Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks

被引：97

作者：

Li, Kun ^{[1
]}

Qian, Xiaojun ^{[1
]}

Meng, Helen ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Hong Kong, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2017年 / 25卷 / 01期

关键词：

Deep neural networks; L2 English speech; mispronunciation detection; mispronunciation diagnosis; speech recognition; PRONUNCIATION ERROR PATTERNS; UNSUPERVISED DISCOVERY; MODELS; REPRESENTATIONS; RECOGNITION; AGREEMENT;

D O I：

10.1109/TASLP.2016.2621675

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper investigates the use of multidistribution deep neural networks (DNNs) for mispronunciation detection and diagnosis (MDD), to circumvent the difficulties encountered in an existing approach based on extended recognition networks (ERNs). The ERNs leverage existing automatic speech recognition technology by constraining the search space via including the likely phonetic error patterns of the target words in addition to the canonical transcriptions. MDDs are achieved by comparing the recognized transcriptions with the canonical ones. Although this approach performs reasonably well, it has the following issues: 1) Learning the error patterns of the target words to generate the ERNs remains a challenging task. Phones or phone errors missing from the ERNs cannot be recognized even if we have well-trained acoustic models; and 2) acoustic models and phonological rules are trained independently, and hence, contextual information is lost. To address these issues, we propose an acoustic-graphemic-phonemic model (AGPM) using a multidistribution DNN, whose input features include acoustic features, as well as corresponding graphemes and canonical transcriptions (encoded as binary vectors). The AGPM can implicitly model both grapheme-to-likely-pronunciation and phoneme-to-likely-pronunciation conversions, which are integrated into acoustic modeling. With the AGPM, we develop a unified MDD framework, which works much like free-phone recognition. Experiments show that our method achieves a phone error rate (PER) of 11.1%. The false rejection rate (FRR), false acceptance rate (FAR), and diagnostic error rate (DER) for MDD are 4.6%, 30.5%, and 13.5%, respectively. It outperforms the ERN approach using DNNs as acoustic models, whose PER, FRR, FAR, and DER are 16.8%, 11.0%, 43.6%, and 32.3%, respectively.

引用

页码：193 / 207

页数：15

共 50 条

[21] Study on the Use of Deep Neural Networks for Speech Activity Detection in Broadcast Recordings
Mateju, Lukas
Cerva, Petr
Zdansky, Jindrich
SIGMAP: PROCEEDINGS OF THE 13TH INTERNATIONAL JOINT CONFERENCE ON E-BUSINESS AND TELECOMMUNICATIONS - VOL. 5, 2016, : 45 - 51
[22] Dissecting neural computations in the human auditory pathway using deep neural networks for speech
Li, Yuanning
Anumanchipalli, Gopala K.
Mohamed, Abdelrahman
Chen, Peili
Carney, Laurel H.
Lu, Junfeng
Wu, Jinsong
Chang, Edward F.
NATURE NEUROSCIENCE, 2023, 26 (12) : 2213 - 2225
[23] Emotional Speech Recognition Using Deep Neural Networks
Trinh Van, Loan
Dao Thi Le, Thuy
Le Xuan, Thanh
Castelli, Eric
SENSORS, 2022, 22 (04)
[24] SPEECH ENHANCEMENT USING MULTIPLE DEEP NEURAL NETWORKS
Karjol, Pavan
Kumar, Ajay M.
Ghosh, Prasanta Kumar
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5049 - 5053
[25] SPEECH ACTIVITY DETECTION IN ONLINE BROADCAST TRANSCRIPTION USING DEEP NEURAL NETWORKS AND WEIGHTED FINITE STATE TRANSDUCERS
Mateju, Lukas
Cerva, Petr
Zdansky, Jindrich
Malek, Jiri
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5460 - 5464
[26] Continuous speech segmentation by L1 and L2 speakers of English: the role of syntactic and prosodic cues
Dobrego, Aleksandra
Konina, Alena
Mauranen, Anna
LANGUAGE AWARENESS, 2023, 32 (03) : 487 - 507
[27] On Line Emotion Detection Using Retrainable Deep Neural Networks
Kollias, Dimitrios
Tagaris, Athanasios
Stafylopatis, Andreas
PROCEEDINGS OF 2016 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2016,
[28] Conversational Speech Transcription Using Context-Dependent Deep Neural Networks
Seide, Frank
Li, Gang
Yu, Dong
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 444 - +
[29] Speech Recognition Using Deep Neural Networks: A Systematic Review
Nassif, Ali Bou
Shahin, Ismail
Attili, Imtinan
Azzeh, Mohammad
Shaalan, Khaled
IEEE ACCESS, 2019, 7 : 19143 - 19165
[30] Enhancing analysis of diadochokinetic speech using deep neural networks
Segal-Feldman, Yael
Hitczenko, Kasia
Goldrick, Matthew
Buchwald, Adam
Roberts, Angela
Keshet, Joseph
COMPUTER SPEECH AND LANGUAGE, 2025, 90

← 1 2 3 4 5 →