SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DYSARTHRIC SPEECH RECOGNITION

被引：16

作者：

Soleymanpour, Mohammad ^{[1
]}

Johnson, Michael T. ^{[1
]}

Soleymanpour, Rahim ^{[2
]}

Berry, Jeffrey ^{[3
]}

机构：

[1] Univ Kentucky, Elect & Comp Engn, Lexington, KY 40506 USA

[2] Univ Connecticut, Dept Biomed Engn, Storrs, CT 06269 USA

[3] Marquette Univ, Speech Pathol & Audiol, Milwaukee, WI 53201 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

基金：

美国国家卫生研究院;

关键词：

Dysarthria; speech recognition; Speech-To-Text; Synthesized speech; Data augmentation;

D O I：

10.1109/ICASSP43922.2022.9746585

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. To have robust dysarthria-specific ASR, sufficient training speech is required, which is not readily available. Recent advances in Text-To-Speech (TTS) synthesis multi-speaker end-to-end systems suggest the possibility of using synthesis for data augmentation. In this paper, we aim to improve multi-speaker end-to-end TTS systems to synthesize dysarthric speech for improved training of a dysarthria-specific DNN-HMM ASR. In the synthesized speech, we add dysarthria severity level and pause insertion mechanisms to other control parameters such as pitch, energy, and duration. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Audio samples are available at https://mohammadelc.github.io/SpeechGroupUKY/

引用

页码：7382 / 7386

页数：5

共 23 条

[1] Improving Speech Recognition using GAN-based Speech Synthesis and Contrastive Unspoken Text Selection [J].

Chen, Zhehuai ;

Rosenberg, Andrew ;

Zhang, Yu ;

Wang, Gary ;

Ramabhadran, Bhuvana ;

Moreno, Pedro J. .

INTERSPEECH 2020, 2020, :556-560

[2] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH [J].

Chien, Chung-Ming ;

Lin, Jheng-Hao ;

Huang, Chien-yu ;

Hsu, Po-chun ;

Lee, Hung-yi .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :8588-8592

[3]

Duffy J. R., 2005, Motor speech disorders: substrates, differential diagnosis, and management

[4] Automatic Speech Recognition with Deep Neural Networks for Impaired Speech [J].

Espana-Bonet, Cristina ;

Fonollosa, Jose A. R. .

ADVANCES IN SPEECH AND LANGUAGE TECHNOLOGIES FOR IBERIAN LANGUAGES, IBERSPEECH 2016, 2016, 10077 :97-107

[5]

Freed D., 2011, MOTOR SPEECH DISORDE

[6] Improving Acoustic Models in TORGO Dysarthric Speech Database [J].

Joy, Neethu Mariam ;

Umesh, S. .

IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, 2018, 26 (03) :637-645

[7] ARTICULATORY COMPARISON OF L1 AND L2 SPEECH FOR MISPRONUNCIATION DIAGNOSIS [J].

Khanal, Subash ;

Johnson, Michael T. ;

Bozorg, Narjes .

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :693-697

[8]

Li Jie, 2018, arXiv preprint arXiv:1812.01192

[9] Montreal Forced Aligner: trainable text-speech alignment using Kaldi [J].

McAuliffe, Michael ;

Socolof, Michaela ;

Mihuc, Sarah ;

Wagner, Michael ;

Sonderegger, Morgan .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :498-502

[10]

MenendezPidal X, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1962, DOI 10.1109/ICSLP.1996.608020

← 1 2 3 →