End-to-End Mispronunciation Detection with Simulated Error Distance

被引：3

作者：

Zhang, Zhan ^{[1
]}

Wang, Yuehai ^{[1
]}

Yang, Jianyi ^{[1
]}

机构：

[1] Zhejiang Univ, Dept Informat & Elect Engn, Hangzhou, Zhejiang, Peoples R China

来源：

INTERSPEECH 2022 | 2022年

关键词：

mispronunciation detection; second language learning; speech recognition; TRANSFORMER; SPEECH;

D O I：

10.21437/Interspeech.2022-870

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

With the development of deep learning, the performance of the mispronunciation detection model has improved greatly. However, the annotation for mispronunciation is quite expensive as it requires the experts to carefully judge the error for each pronounced phoneme. As a result, the supervised end-to-end mispronunciation detection model faces the problem of data shortage. Although the text-based data augmentation can partially alleviate this problem, we analyze that it only simulates the categorical phoneme error. Such a simulation is inefficient for the real situation. In this paper, we propose a novel unit-based data augmentation method. Our method converts the continuous audio signal into the robust audio vector and then into the discrete unit sequence. By modifying this unit sequence, we generate a more reasonable mispronunciation and can get the vector distance as the error indicator. By training on such simulated data, the experiments on L2Arctic show that our method can improve the performance of the mispronunciation detection task compared with the text-based method.

引用

页码：4327 / 4331

页数：5

共 27 条

[1] Babu A., 2021, ARXIV211109296
[2] Baevski A., 2020, wav2vec 2.0: A Framework for SelfSupervised Learning of Speech Representations
[3] Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: Analysis on iCALL
Chen, Nancy F.
Wee, Darren
Tong, Rong
Ma, Bin
Li, Haizhou
[J]. SPEECH COMMUNICATION, 2016, 84 : 46 - 56
[4] Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[5] Dunbar E., 2017, ARXIV171204313
[6] An examination of the different ways that non-native phones may be perceptually assimilated as uncategorized
Faris, Mona M.
Best, Catherine T.
Tyler, Michael D.
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (01) : EL1 - EL5
[7] Feng Y., 2020, IEEE INT C AC SPEECH
[8] Fu K., 2021, ARXIV210408428
[9] Harrison AM, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2787
[10] Leung WK, 2019, INT CONF ACOUST SPEE, P8132, DOI 10.1109/ICASSP.2019.8682654

← 1 2 3 →