Few-shot dysarthric speech recognition with text-to-speech data augmentation

被引:3
作者
Hermann, Enno [1 ]
Magimai-Doss, Mathew [1 ]
机构
[1] Idiap Res Inst, Martigny, Switzerland
来源
INTERSPEECH 2023 | 2023年
关键词
automatic speech recognition; dysarthric speech; text-to-speech; few-shot learning;
D O I
10.21437/Interspeech.2023-2481
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speakers with dysarthria could particularly benefit from assistive speech technology, but are underserved by current automatic speech recognition (ASR) systems. The differences of dysarthric speech pose challenges, while recording large amounts of training data can be exhausting for patients. In this paper, we synthesise dysarthric speech with a FastSpeech 2-based multi-speaker text-to-speech (TTS) system for ASR data augmentation. We evaluate its few-shot capability by generating dysarthric speech with as few as 5 words from an unseen target speaker and then using it to train speaker-dependent ASR systems. The results indicated that, while the TTS output is not yet of sufficient quality, this could allow easy development of personalised acoustic models for new dysarthric speakers and domains in the future.
引用
收藏
页码:156 / 160
页数:5
相关论文
共 50 条
[21]   USING VAES AND NORMALIZING FLOWS FOR ONE-SHOT TEXT-TO-SPEECH SYNTHESIS OF EXPRESSIVE SPEECH [J].
Aggarwal, Vatsal ;
Cotescu, Marius ;
Prateek, Nishant ;
Lorenzo-Trueba, Jaime ;
Barra-Chicote, Roberto .
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :6179-6183
[22]   Automatic speech recognition and text-to-speech technologies for L2 pronunciation improvement: reflections on their affordances [J].
Gottardi, William ;
de Almeida, Janaina Fernanda ;
Soufen Tumolo, Celso Henrique .
TEXTO LIVRE-LINGUAGEM E TECNOLOGIA, 2022, 15
[23]   Cross-Corpus Speech Emotion Recognition Based on Few-Shot Learning and Domain Adaptation [J].
Ahn, Youngdo ;
Lee, Sung Joo ;
Shin, Jong Won .
IEEE SIGNAL PROCESSING LETTERS, 2021, 28 :1190-1194
[24]   Use of Speech Impairment Severity for Dysarthric Speech Recognition [J].
Geng, Mengzhe ;
Jin, Zengrui ;
Wang, Tianzi ;
Hu, Shujie ;
Deng, Jiajun ;
Cui, Mingyu ;
Li, Guinan ;
Yu, Jianwei ;
Xie, Xurong ;
Liu, Xunying .
INTERSPEECH 2023, 2023, :2328-2332
[25]   Evaluation of an Automatic Speech Recognition Platform for Dysarthric Speech [J].
Calvo, Irene ;
Tropea, Peppino ;
Vigano, Mauro ;
Scialla, Maria ;
Cavalcante, Agnieszka B. ;
Grajzer, Monika ;
Gilardone, Marco ;
Corbo, Massimo .
FOLIA PHONIATRICA ET LOGOPAEDICA, 2021, 73 (05) :432-441
[26]   Few-shot learning in realistic settings for text CAPTCHA recognition [J].
Wang Y. ;
Wei Y. ;
Zhang Y. ;
Jin C. ;
Xin G. ;
Wang B. .
Neural Computing and Applications, 2023, 35 (15) :10751-10764
[27]   NORMALIZATION OF TEXT MESSAGES FOR TEXT-TO-SPEECH [J].
Pennell, Deana L. ;
Liu, Yang .
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, :4842-4845
[28]   Few-shot learning in realistic settings for text CAPTCHA recognition [J].
Wang, Yao ;
Wei, Yuliang ;
Zhang, Yifan ;
Jin, Chuhao ;
Xin, Guodong ;
Wang, Bailing .
NEURAL COMPUTING & APPLICATIONS, 2023, 35 (15) :10751-10764
[29]   XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model [J].
Casanova, Edresson ;
Davis, Kelly ;
Goelge, Eren ;
Goekncar, Gorkem ;
Gulea, Iulian ;
Hart, Logan ;
Aljafari, Aya ;
Meyer, Joshua ;
Morais, Reuben ;
Olayemi, Samuel ;
Weber, Julian .
INTERSPEECH 2024, 2024, :4978-4982
[30]   Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis [J].
Chen, Zhiyong ;
Ai, Zhiqi ;
Ma, Youxuan ;
Li, Xinnuo ;
Xu, Shugong .
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2024, 2024 (01)