Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks

被引:97
|
作者
Li, Kun [1 ]
Qian, Xiaojun [1 ]
Meng, Helen [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Hong Kong, Peoples R China
关键词
Deep neural networks; L2 English speech; mispronunciation detection; mispronunciation diagnosis; speech recognition; PRONUNCIATION ERROR PATTERNS; UNSUPERVISED DISCOVERY; MODELS; REPRESENTATIONS; RECOGNITION; AGREEMENT;
D O I
10.1109/TASLP.2016.2621675
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper investigates the use of multidistribution deep neural networks (DNNs) for mispronunciation detection and diagnosis (MDD), to circumvent the difficulties encountered in an existing approach based on extended recognition networks (ERNs). The ERNs leverage existing automatic speech recognition technology by constraining the search space via including the likely phonetic error patterns of the target words in addition to the canonical transcriptions. MDDs are achieved by comparing the recognized transcriptions with the canonical ones. Although this approach performs reasonably well, it has the following issues: 1) Learning the error patterns of the target words to generate the ERNs remains a challenging task. Phones or phone errors missing from the ERNs cannot be recognized even if we have well-trained acoustic models; and 2) acoustic models and phonological rules are trained independently, and hence, contextual information is lost. To address these issues, we propose an acoustic-graphemic-phonemic model (AGPM) using a multidistribution DNN, whose input features include acoustic features, as well as corresponding graphemes and canonical transcriptions (encoded as binary vectors). The AGPM can implicitly model both grapheme-to-likely-pronunciation and phoneme-to-likely-pronunciation conversions, which are integrated into acoustic modeling. With the AGPM, we develop a unified MDD framework, which works much like free-phone recognition. Experiments show that our method achieves a phone error rate (PER) of 11.1%. The false rejection rate (FRR), false acceptance rate (FAR), and diagnostic error rate (DER) for MDD are 4.6%, 30.5%, and 13.5%, respectively. It outperforms the ERN approach using DNNs as acoustic models, whose PER, FRR, FAR, and DER are 16.8%, 11.0%, 43.6%, and 32.3%, respectively.
引用
收藏
页码:193 / 207
页数:15
相关论文
共 50 条
  • [1] Mispronunciation Detection and Diagnosis in L2 English Speech Using Multi-Distribution Deep Neural Networks
    Li, Kun
    Meng, Helen
    2014 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2014, : 255 - 259
  • [2] UNSUPERVISED DISCOVERY OF AN EXTENDED PHONEME SET IN L2 ENGLISH SPEECH FOR MISPRONUNCIATION DETECTION AND DIAGNOSIS
    Mao, Shaoguang
    Li, Xu
    Li, Kun
    Wu, Zhiyong
    Liu, Xunying
    Meng, Helen
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6244 - 6248
  • [3] Lexical Stress Detection for L2 English Speech Using Deep Belief Networks
    Li, Kun
    Qian, Xiaojun
    Kang, Shiyin
    Meng, Helen
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1810 - 1814
  • [4] Mispronunciation detection and diagnosis using deep neural networks: a systematic review
    Lounis, Meriem
    Dendani, Bilal
    Bahi, Halima
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (23) : 62793 - 62827
  • [5] Unsupervised Discovery of Non-native Phonetic Patterns in L2 English Speech for Mispronunciation Detection and Diagnosis
    Li, Xu
    Mao, Shaoguang
    Wu, Xixin
    Li, Kun
    Liu, Xunying
    Meng, Helen
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2554 - 2558
  • [6] INTEGRATING ARTICULATORY FEATURES INTO ACOUSTIC-PHONEMIC MODEL FOR MISPRONUNCIATION DETECTION AND DIAGNOSIS IN L2 ENGLISH SPEECH
    Mao, Shaoguang
    Wu, Zhiyong
    Li, Xu
    Li, Runnan
    Wu, Xixin
    Meng, Helen
    2018 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2018,
  • [7] APPLYING MULTITASK LEARNING TO ACOUSTIC-PHONEMIC MODEL FOR MISPRONUNCIATION DETECTION AND DIAGNOSIS IN L2 ENGLISH SPEECH
    Mao, Shaoguang
    Wu, Zhiyong
    Li, Runnan
    Li, Xu
    Meng, Helen
    Cai, Lianhong
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6254 - 6258
  • [8] Intonation classification for L2 English speech using multi-distribution deep neural networks
    Li, Kun
    Wu, Xixin
    Meng, Helen
    COMPUTER SPEECH AND LANGUAGE, 2017, 43 : 18 - 33
  • [9] L2-GEN: A Neural Phoneme Paraphrasing Approach to L2 Speech Synthesis for Mispronunciation Diagnosis
    Zhang, Daniel Yue
    Ganesan, Ashwinkumar
    Campbell, Sarah
    Korzekwa, Daniel
    INTERSPEECH 2022, 2022, : 4317 - 4321
  • [10] An Alignment Method Leveraging Articulatory Features for Mispronunciation Detection and Diagnosis in L2 English
    Chen, Qi
    Lin, Binghuai
    Xie, Yanlu
    INTERSPEECH 2022, 2022, : 4342 - 4346