Automatic segmentation of speech articulators from real-time midsagittal MRI based on supervised learning

被引:24
作者
Labrunie, Mathieu [1 ]
Badin, Pierre [1 ]
Voit, Dirk [2 ]
Joseph, Arun A. [2 ]
Frahm, Jens [2 ]
Lamalle, Laurent [3 ]
Vilain, Coriandre [1 ]
Boe, Louis-Jean [1 ]
机构
[1] Univ Grenoble Alpes, G1PSA Lab, Grenoble INP, CNRS, F-38000 Grenoble, France
[2] Max Planck Inst Biophys Chem, Biomed NMR Forsch GmbH, Gottingen, Germany
[3] Univ Grenoble Alpes, CHU Grenoble Alpes, INSERM, CNRS,UMS IRMaGe,Inserm US 17,CNRS UMS 3552, F-38043 Grenoble, France
关键词
Real-time MRI; Speech articulation; Articulator segmentation; Multiple Linear Regression; Active Shape Models; Shape Particle Filtering; VOCAL-TRACT; TONGUE MOVEMENTS; SHAPE; IDENTIFICATION; RESOLUTION; TRACKING; CONTOURS; MODEL; JAW; LIP;
D O I
10.1016/j.specom.2018.02.004
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech production mechanisms can be characterized at a peripheral level by both their acoustic and articulatory traces along time. Researchers have thus developed very large efforts to measure articulation. Thanks to the spectacular progress accomplished in the last decade, real-time Magnetic Resonance Imaging (RT-MRI) offers nowadays the advantages of frame rates closer than before to those achieved by electromagnetic articulography or ultrasound echography while providing very detailed geometrical information about the whole vocal tract. RT-MRI has thus become inescapable for the study of speech articulators' movements. However, making efficient use of large sets of images to characterize and model speech tasks implies the development of automatic methods to segment the articulators from these images with sufficient accuracy. The present article describes our approach to develop, based on supervised machine learning techniques, an automatic segmentation method that offers various useful features such as (1) capability of dealing with individual articulators independently, (2) ensuring hard palate, jaw and hyoid bone to be adequately tracked as rigid structures, (3) delivering contours for a full set of articulators, including the epiglottis and the back of the larynx, which partly reflects the vocal fold abduction / adduction state, (4) dealing more explicitly and thus more accurately with contact between articulators, and (5) reaching an accuracy better than one millimeter. The main contributions of this work are the following. We have recorded the first large database of high quality RTMRI midsagittal images for a French speaker. We have manually segmented the main speech articulators (jaw, lips, tongue, velum, hyoid, larynx, etc.) for a small training set of about 60 images selected by hierarchical clustering to represent the whole corpus as faithfully as possible. We have used these data to train various image and contour models for developing automatic articulatory segmentation methods. The first method, based on Multiple Linear Regression, allows to predict the contour coordinates from the image pixel intensities with a Mean Sum of Distances (MSD) segmentation error over all articulators of 0.91 mm, computed with a Leave-One-Out Cross Validation procedure on the training set. Another method, based on Shape Particle Filtering, reaches an MSD error of 0.66 mm. Finally the modified version of Active Shape Models (mASM) explored in this study gives an MSD error of a mere 0.55 nun (0.68 mm for the tongue). These results demonstrate that this mASM approach performs better than state-of-the-art methods, though at the cost of the manual segmentation of the training set. The same method used on other MRI data leads to similar errors, which testifies to its robustness. The large quantity of contour data that can be obtained with this automatic segmentation method opens the way to various fruitful perspectives in speech: establishing more elaborate articulatory models, analyzing more finely coarticulation and articulatory variability or invariance, implementing machine learning methods for articulatory speaker normalization or adaptation, or illustrating adequate or prototypical articulatory gestures for application in the domains of speech therapy and of second language pronunciation training.
引用
收藏
页码:27 / 46
页数:20
相关论文
共 50 条
  • [31] Dynamic off-resonance correction for spiral real-time MRI of speech
    Lim, Yongwan
    Lingala, Sajan Goud
    Narayanan, Shrikanth S.
    Nayak, Krishna S.
    MAGNETIC RESONANCE IN MEDICINE, 2019, 81 (01) : 234 - 246
  • [32] Automatic Data-Driven Learning of Articulatory Primitives from Real-Time MRI Data using Convolutive NMF with Sparseness Constraints
    Ramanarayanan, Vikram
    Katsamanis, Athanasios
    Narayanan, Shrikanth
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 68 - 71
  • [33] Real-Time Dynamic SLAM Using Moving Probability Based on IMU and Segmentation
    Zhang, Hanxuan
    Wang, Dingyi
    Huo, Ju
    IEEE SENSORS JOURNAL, 2024, 24 (07) : 10878 - 10891
  • [34] Real-Time Dynamic Background Segmentation Based on a Statistical Approach
    Peng, Jian-Wen
    Horng, Wen-Bing
    2009 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL, VOLS 1 AND 2, 2009, : 398 - +
  • [35] REAL-TIME VIDEO OBJECT SEGMENTATION ALGORITHM BASED ON CHANGE DETECTION AND BACKGROUND UPDATING
    Chen, Tsong-Yi
    Chen, Thou-Ho
    Wang, Da-Jinn
    Chiou, Yung-Chuen
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2009, 5 (07): : 1797 - 1810
  • [36] Real-time speech MRI datasets with corresponding articulator ground-truth segmentations
    Ruthven, Matthieu
    Peplinski, Agnieszka M.
    Adams, David M.
    King, Andrew P.
    Miquel, Marc Eric
    SCIENTIFIC DATA, 2023, 10 (01)
  • [37] Feasibility of through-time spiral generalized autocalibrating partial parallel acquisition for low latency accelerated real-time MRI of speech
    Lingala, Sajan Goud
    Zhu, Yinghua
    Lim, Yongwan
    Toutios, Asterios
    Ji, Yunhua
    Lo, Wei-Ching
    Seiberlich, Nicole
    Narayanan, Shrikanth
    Nayak, Krishna S.
    MAGNETIC RESONANCE IN MEDICINE, 2017, 78 (06) : 2275 - 2282
  • [38] Tunable and real-time automatic interventional x-ray collimation from semi-supervised deep feature extraction
    Lee, Brian C.
    Rijhwani, Damini
    Lang, Sydney
    van Oorde-Grainger, Shaun
    Haak, Alexander
    Bleise, Carlos
    Lylyk, Pedro
    Ruijters, Daniel
    Sinha, Ayushi
    MEDICAL PHYSICS, 2025, 52 (03) : 1372 - 1389
  • [39] A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images
    Lim, Yongwan
    Toutios, Asterios
    Bliesener, Yannick
    Tian, Ye
    Lingala, Sajan Goud
    Vaz, Colin
    Sorensen, Tanner
    Oh, Miran
    Harper, Sarah
    Chen, Weiyi
    Lee, Yoonjeong
    Toger, Johannes
    Monteserin, Mairym Llorens
    Smith, Caitlin
    Godinez, Bianca
    Goldstein, Louis
    Byrd, Dani
    Nayak, Krishna S.
    Narayanan, Shrikanth S.
    SCIENTIFIC DATA, 2021, 8 (01)
  • [40] Automated machine learning based speech classification for hearing aid applications and its real-time implementation on smartphone
    Bhat, Gautam Shreedhar
    Shankar, Nikhil
    Panahi, Issa M. S.
    42ND ANNUAL INTERNATIONAL CONFERENCES OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY: ENABLING INNOVATIVE TECHNOLOGIES FOR GLOBAL HEALTHCARE EMBC'20, 2020, : 956 - 959