End-to-End Mispronunciation Detection with Simulated Error Distance

被引:3
作者
Zhang, Zhan [1 ]
Wang, Yuehai [1 ]
Yang, Jianyi [1 ]
机构
[1] Zhejiang Univ, Dept Informat & Elect Engn, Hangzhou, Zhejiang, Peoples R China
来源
INTERSPEECH 2022 | 2022年
关键词
mispronunciation detection; second language learning; speech recognition; TRANSFORMER; SPEECH;
D O I
10.21437/Interspeech.2022-870
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
With the development of deep learning, the performance of the mispronunciation detection model has improved greatly. However, the annotation for mispronunciation is quite expensive as it requires the experts to carefully judge the error for each pronounced phoneme. As a result, the supervised end-to-end mispronunciation detection model faces the problem of data shortage. Although the text-based data augmentation can partially alleviate this problem, we analyze that it only simulates the categorical phoneme error. Such a simulation is inefficient for the real situation. In this paper, we propose a novel unit-based data augmentation method. Our method converts the continuous audio signal into the robust audio vector and then into the discrete unit sequence. By modifying this unit sequence, we generate a more reasonable mispronunciation and can get the vector distance as the error indicator. By training on such simulated data, the experiments on L2Arctic show that our method can improve the performance of the mispronunciation detection task compared with the text-based method.
引用
收藏
页码:4327 / 4331
页数:5
相关论文
共 27 条
  • [1] Babu A., 2021, ARXIV211109296
  • [2] Baevski A., 2020, wav2vec 2.0: A Framework for SelfSupervised Learning of Speech Representations
  • [3] Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: Analysis on iCALL
    Chen, Nancy F.
    Wee, Darren
    Tong, Rong
    Ma, Bin
    Li, Haizhou
    [J]. SPEECH COMMUNICATION, 2016, 84 : 46 - 56
  • [4] Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
  • [5] Dunbar E., 2017, ARXIV171204313
  • [6] An examination of the different ways that non-native phones may be perceptually assimilated as uncategorized
    Faris, Mona M.
    Best, Catherine T.
    Tyler, Michael D.
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (01) : EL1 - EL5
  • [7] Feng Y., 2020, IEEE INT C AC SPEECH
  • [8] Fu K., 2021, ARXIV210408428
  • [9] Harrison AM, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2787
  • [10] Leung WK, 2019, INT CONF ACOUST SPEE, P8132, DOI 10.1109/ICASSP.2019.8682654