Applying DNN Adaptation to Reduce the Session Dependency of Ultrasound Tongue Imaging-based Silent Speech Interfaces

被引:0
|
作者
Gosztolya, Gabor [1 ,2 ,3 ]
Grosz, Tamas [3 ,4 ]
Toth, Laszlo [3 ]
Marko, Alexandra [5 ,7 ]
Csapo, Tamas Gabor [6 ,7 ]
机构
[1] Hungarian Acad Sci, MTA SZTE Res Grp Artificial Intelligence, Tisza Lajos Krt 103, H-6720 Szeged, Hungary
[2] Univ Szeged, Tisza Lajos Krt 103, H-6720 Szeged, Hungary
[3] Univ Szeged, Inst Informat, Arpad Ter 2, H-6720 Szeged, Hungary
[4] Aalto Univ, Dept Signal Proc & Acoust, Otakaari 3, FI-02150 Espoo, Finland
[5] Eotvos Lorand Univ, Dept Appl Linguist & Phonet, Muzeum Krt 4-A, H-1088 Budapest, Hungary
[6] Budapest Univ Technol & Econ, Dept Telecommuniat & Media Informat, Magyar Tudosok Korutja 2, H-1117 Budapest, Hungary
[7] MTA ELTE Lingual Articulat Res Grp, Muzeum Krt 4-A, H-1088 Budapest, Hungary
关键词
Silent speech interfaces; articulatory-to-acoustic mapping; session dependency; Deep Neural Networks; DNN adaptation; DEEP NEURAL-NETWORKS; RECOGNITION;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Silent Speech Interfaces (SSI) perform articulatory-to-acoustic mapping to convert articulatory movement into synthesized speech. Its main goal is to aid the speech handicapped, or to be used as a part of a communication system operating in silence-required environments or in those with high background noise. Although many previous studies addressed the speaker-dependency of SSI models, session-dependency is also an important issue due to the possible misalignment of the recording equipment. In particular, there are currently no solutions available, in the case of tongue ultrasound recordings. In this study, we investigate the degree of session-dependency of standard feed-forward DNN-based models for ultrasound-based SSI systems. Besides examining the amount of training data required for speech synthesis parameter estimation, we also show that DNN adaptation can be useful for handling session dependency. Our results indicate that by using adaptation, less training data and training time are needed to achieve the same speech quality over training a new DNN from scratch. Our experiments also suggest that the sub-optimal cross-session behavior is caused by the misalignment of the recording equipment, as adapting just the lower, feature extractor layers of the neural network proved to be sufficient, in achieving a comparative level of performance.
引用
收藏
页码:109 / 124
页数:16
相关论文
共 12 条
  • [1] Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks
    Toth, Laszlo
    Shandiz, Amin Honarmandi
    Gosztolya, Gabor
    Csapo, Tamas Gabor
    INTERSPEECH 2023, 2023, : 1169 - 1173
  • [2] F0 ESTIMATION FOR DNN-BASED ULTRASOUND SILENT SPEECH INTERFACES
    Grosz, Tamas
    Gosztolya, Gabor
    Toth, Laszlo
    Csapo, Tamas Gabor
    Marko, Alexandra
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 291 - 295
  • [3] SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks
    Kimura, Naoki
    Kono, Michinari
    Rekimoto, Jun
    CHI 2019: PROCEEDINGS OF THE 2019 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, 2019,
  • [4] DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface
    Csapo, Temas Gabor
    Grosz, Tamas
    Gosztolya, Gabor
    Toth, Laszlo
    Marko, Alexandra
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3672 - 3676
  • [5] Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces
    Shandiz, Amin Honarmandi
    Toth, Laszlo
    Gosztolya, Gabor
    Marko, Alexandra
    Csapo, Tamas Gabor
    INTERSPEECH 2021, 2021, : 1932 - 1936
  • [6] DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging
    Porras, Dagoberto
    Sepulveda-Sepulveda, Alexander
    Csapo, Tamas Gabor
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [7] Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces
    Gosztolya, Gabor
    Pinter, Adam
    Toth, Laszlo
    Grosz, Tamas
    Marko, Alexandra
    Csapo, Tamas Gabor
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [8] Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces
    Toth, Laszlo
    Gosztolya, Gabor
    Grosz, Tamas
    Marko, Alexandra
    Csapo, Tamas Gabor
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3172 - 3176
  • [9] SSR7000: A Synchronized Corpus of Ultrasound Tongue Imaging for End-to-End Silent Speech Recognition
    Kimura, Naoki
    Su, Zixiong
    Saeki, Takaaki
    Rekimoto, Jun
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6866 - 6873
  • [10] Effects of F0 Estimation Algorithms on Ultrasound- Based Silent Speech Interfaces
    Dai, Pengyu
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    2021 INTERNATIONAL CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2021, : 47 - 51