LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

被引:0
作者
Kim, Haechan [1 ,2 ]
Myung, Junho [2 ]
Kim, Seoyoung [2 ]
Lee, Sungpah [1 ]
Kang, Dongyeop [3 ]
Kim, Juho [1 ,2 ]
机构
[1] Ringle, Seoul, South Korea
[2] Korea Adv Inst Sci & Technol, Sch Comp, Daejeon, South Korea
[3] Univ Minnesota, Minneapolis, MN USA
来源
INTERSPEECH 2024 | 2024年
关键词
speech recognition; non-native spontaneous speech; English as a second/foreign language; RECOGNITION; ERROR;
D O I
10.21437/Interspeech.2024-2392
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.
引用
收藏
页码:2325 / 2329
页数:5
相关论文
共 20 条
  • [1] Adda-Decker M., 2005, PROC DISFLUENCY SPON, P27
  • [2] Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction
    Bryant, Christopher
    Felice, Mariano
    Briscoe, Ted
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 793 - 805
  • [3] Dufour R., 2009, SPONTANEOUS SPEECH C
  • [4] Percent Grammatical Responses as a General Outcome Measure: Initial Validity
    Eisenberg, Sarita L.
    Guo, Ling-Yu
    [J]. LANGUAGE SPEECH AND HEARING SERVICES IN SCHOOLS, 2018, 49 (01) : 98 - 107
  • [5] Frieske R, 2024, Arxiv, DOI arXiv:2401.01572
  • [6] Godfrey J. J., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), P517, DOI 10.1109/ICASSP.1992.225858
  • [7] Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates
    Goldwater, Sharon
    Jurafsky, Dan
    Manning, Christopher D.
    [J]. SPEECH COMMUNICATION, 2010, 52 (03) : 181 - 200
  • [8] Gretter R, 2019, INT CONF ACOUST SPEE, P7435, DOI 10.1109/ICASSP.2019.8683268
  • [9] Hassanali K.N., 2015, Speech Lang. Technol. Educ. SLaTE, P13
  • [10] Housen A., 2012, Dimensions of L2 performance and proficiency: Complexity, accuracy, and fluency in SLA, P1, DOI DOI 10.1075/LLLT.32.01HOU