Speech Synthesis from Articulatory Movements Recorded by Real-time MRI

被引：4

作者：

Otani, Yuto ^{[1
]}

Sawada, Shun ^{[1
]}

Ohmura, Hidefumi ^{[1
]}

Katsurada, Kouichi ^{[1
]}

机构：

[1] Tokyo Univ Sci, Dept Informat Sci, Tokyo, Japan

来源：

INTERSPEECH 2023 | 2023年

关键词：

real-time MRI; articulatory movement; speech synthesis; speech waveform generation; ELECTROMAGNETIC ARTICULOGRAPHY; EXTRACTION; TRACKING; DATABASE; NOISE;

D O I：

10.21437/Interspeech.2023-286

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Previous speech synthesis models from articulatory movements recorded using real-time MRI (rtMRI) only predicted vocal tract shape parameters and required additional pitch information to generate a speech waveform. This study proposes a two-stage deep learning model composed of CNN-BiLSTM that predicts a mel-spectrogram from a rtMRI video and a HiFi-GAN vocoder that synthesizes a speech waveform. We evaluated our model on two databases: the ATR 503 sentences rtMRI database and the USC-TIMIT database. The experimental results on the ATR 503 sentences rtMRI database show that the PESQ score and the RMSE of F0 are 1.64 and 26.7 Hz. This demonstrates that all acoustic parameters, including fundamental frequency, can be estimated from the rtMRI videos. In the experiment on the USC-TIMIT database, we obtained a good PESQ score and RMSE for F0. However, the synthesized speech is unclear, indicating that the quality of the datasets affects the intelligibility of the synthesized speech.

引用

页码：127 / 131

页数：5

共 33 条

[1] Extraction and tracking of the tongue surface from ultrasound image sequences [J].

Akgul, YS ;

Kambhamettu, C ;

Stone, M .

1998 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS, 1998, :298-303

[2] SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].

BOLL, SF .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120

[3] Seeing speech: Capturing vocal tract shaping using real-time magnetic resonance imaging [J].

Bresch, Erik ;

Kim, Yoon-Chul ;

Nayak, Krishna ;

Byrd, Dani ;

Narayanan, Shrikanth .

IEEE SIGNAL PROCESSING MAGAZINE, 2008, 25 (03) :123-+

[4] Synchronized and noise-robust audio recordings during realtime magnetic resonance imaging scans (L) [J].

Bresch, Erik ;

Nielsen, Jon ;

Nayak, Krishna ;

Narayanan, Shrikanth .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (04) :1791-1794

[5] Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract [J].

Csapo, Tamas Gabor .

INTERSPEECH 2020, 2020, :2722-2726

[6] Reliability of electromagnetic articulography recording during speaking sequences [J].

Horn, H ;

Goz, G ;

Bacher, M ;

Mullauer, M ;

Kretschmer, I ;

Axmann-Krcmar, D .

EUROPEAN JOURNAL OF ORTHODONTICS, 1997, 19 (06) :647-655

[7] Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers [J].

Isaieva, Karyna ;

Laprie, Yves ;

Leclere, Justine ;

Douros, Ioannis K. ;

Felblinger, Jacques ;

Vuissoz, Pierre-Andre .

SCIENTIFIC DATA, 2021, 8 (01)

[8]

Ito Keith., 2017, The LJ Speech Dataset

[9]

Kong Jungil, 2020, ADV NEUR IN, V33

[10]

KUBICHEK RF, 1993, IEEE PACIF, P125, DOI 10.1109/PACRIM.1993.407206

← 1 2 3 4 →