Fine-Tuning Self-Supervised Learning Models for End-to-End Pronunciation Scoring

被引:0
|
作者
Zahran, Ahmed I. [1 ]
Fahmy, Aly A. [1 ]
Wassif, Khaled T. [1 ]
Bayomi, Hanaa [1 ]
机构
[1] Cairo Univ, Fac Comp & Artificial Intelligence, Giza 12613, Orman, Egypt
关键词
Automatic pronunciation assessment; pronunciation scoring; pre-trained speech representations; self-supervised speech representation learning; wav2vec; 2.0; WavLM; HuBERT;
D O I
10.1109/ACCESS.2023.3317236
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic pronunciation assessment models are regularly used in language learning applications. Common methodologies for pronunciation assessment use feature-based approaches, such as the Goodness-of-Pronunciation (GOP) approach, or deep learning speech recognition models to perform speech assessment. With the rise of transformers, pre-trained self-supervised learning (SSL) models have been utilized to extract contextual speech representations, showing improvements in various downstream tasks. In this study, we propose the end-to-end regressor (E2E-R) model for pronunciation scoring. E2E-R is trained using a two-step training process. In the first step, the pre-trained SSL model is fine-tuned on a phoneme recognition task to obtain better representations for the pronounced phonemes. In the second step, transfer learning is used to build a pronunciation scoring model that uses a Siamese neural network to compare the pronounced phoneme representations to embeddings of the canonical phonemes and produce the final pronunciation scores. E2E-R achieves a Pearson correlation coefficient (PCC) of 0.68, which is almost similar to the state-of-the-art GOPT-PAII model while eliminating the need for training on additional native speech data, feature engineering, or external forced alignment modules. To our knowledge, this work presents the first utilization of a pre-trained SSL model for end-to-end phoneme-level pronunciation scoring on raw speech waveforms. The code is available at https://github.com/ai-zahran/E2E-R.
引用
收藏
页码:112650 / 112663
页数:14
相关论文
共 50 条
  • [21] Fine-Tuning for Bayer Demosaicking Through Periodic-Consistent Self-Supervised Learning
    Liu, Chang
    He, Songze
    Xu, Jiajun
    Li, Jia
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 989 - 993
  • [22] END-TO-END MUSIC REMASTERING SYSTEM USING SELF-SUPERVISED AND ADVERSARIAL TRAINING
    Koo, Junghyun
    Paik, Seungryeol
    Lee, Kyogu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4608 - 4612
  • [23] Investigating Self-supervised Pre-training for End-to-end Speech Translation
    Ha Nguyen
    Bougares, Fethi
    Tomashenko, Natalia
    Esteve, Yannick
    Besacier, Laurent
    INTERSPEECH 2020, 2020, : 1466 - 1470
  • [24] PVStereo: Pyramid Voting Module for End-to-End Self-Supervised Stereo Matching
    Wang, Hengli
    Fan, Rui
    Cai, Peide
    Liu, Ming
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (03) : 4353 - 4360
  • [25] FINE-TUNING STRATEGIES FOR FASTER INFERENCE USING SPEECH SELF-SUPERVISED MODELS: A COMPARATIVE STUDY
    Zaiem, Salah
    Algayres, Robin
    Parcollet, Titouan
    Essid, Slim
    Ravanelli, Mirco
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [26] Self-supervised Fine-tuning for Efficient Passage Re-ranking
    Kim, Meoungjun
    Ko, Youngjoong
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3142 - 3146
  • [27] Exploiting Fine-tuning of Self-supervised Learning Models for Improving Bi-modal Sentiment Analysis and Emotion Recognition
    Yang, Wei
    Fukayama, Satoru
    Heracleous, Panikos
    Ogata, Jun
    INTERSPEECH 2022, 2022, : 1998 - 2002
  • [28] End-to-End Learning from Noisy Crowd to Supervised Machine Learning Models
    Younesian, Taraneh
    Hong, Chi
    Ghiassi, Amirmasoud
    Birke, Robert
    Chen, Lydia Y.
    2020 IEEE SECOND INTERNATIONAL CONFERENCE ON COGNITIVE MACHINE INTELLIGENCE (COGMI 2020), 2020, : 17 - 26
  • [29] Learning end-to-end patient representations through self-supervised covariate balancing for causal treatment effect estimation
    Tesei, Gino
    Giampanis, Stefanos
    Shi, Jingpu
    Norgeot, Beau
    JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 140
  • [30] FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition
    Chen, Szu-Jui
    Xie, Jiamin
    Hansen, John H. L.
    INTERSPEECH 2022, 2022, : 3058 - 3062