Fine-Tuning Self-Supervised Learning Models for End-to-End Pronunciation Scoring

被引:0
|
作者
Zahran, Ahmed I. [1 ]
Fahmy, Aly A. [1 ]
Wassif, Khaled T. [1 ]
Bayomi, Hanaa [1 ]
机构
[1] Cairo Univ, Fac Comp & Artificial Intelligence, Giza 12613, Orman, Egypt
关键词
Automatic pronunciation assessment; pronunciation scoring; pre-trained speech representations; self-supervised speech representation learning; wav2vec; 2.0; WavLM; HuBERT;
D O I
10.1109/ACCESS.2023.3317236
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic pronunciation assessment models are regularly used in language learning applications. Common methodologies for pronunciation assessment use feature-based approaches, such as the Goodness-of-Pronunciation (GOP) approach, or deep learning speech recognition models to perform speech assessment. With the rise of transformers, pre-trained self-supervised learning (SSL) models have been utilized to extract contextual speech representations, showing improvements in various downstream tasks. In this study, we propose the end-to-end regressor (E2E-R) model for pronunciation scoring. E2E-R is trained using a two-step training process. In the first step, the pre-trained SSL model is fine-tuned on a phoneme recognition task to obtain better representations for the pronounced phonemes. In the second step, transfer learning is used to build a pronunciation scoring model that uses a Siamese neural network to compare the pronounced phoneme representations to embeddings of the canonical phonemes and produce the final pronunciation scores. E2E-R achieves a Pearson correlation coefficient (PCC) of 0.68, which is almost similar to the state-of-the-art GOPT-PAII model while eliminating the need for training on additional native speech data, feature engineering, or external forced alignment modules. To our knowledge, this work presents the first utilization of a pre-trained SSL model for end-to-end phoneme-level pronunciation scoring on raw speech waveforms. The code is available at https://github.com/ai-zahran/E2E-R.
引用
收藏
页码:112650 / 112663
页数:14
相关论文
共 50 条
  • [1] Improving fine-tuning of self-supervised models with Contrastive Initialization
    Pan, Haolin
    Guo, Yong
    Deng, Qinyi
    Yang, Haomin
    Chen, Jian
    Chen, Yiqun
    NEURAL NETWORKS, 2023, 159 : 198 - 207
  • [2] ActiveStereoNet: End-to-End Self-supervised Learning for Active Stereo Systems
    Zhang, Yinda
    Khamis, Sameh
    Rhemann, Christoph
    Valentin, Julien
    Kowdle, Adarsh
    Tankovich, Vladimir
    Schoenberg, Michael
    Izadi, Shahram
    Funkhouser, Thomas
    Fanello, Sean
    COMPUTER VISION - ECCV 2018, PT VIII, 2018, 11212 : 802 - 819
  • [3] An End-to-End Contrastive Self-Supervised Learning Framework for Language Understanding
    Fang, Hongchao
    Xie, Pengtao
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 1324 - 1340
  • [4] Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement
    Yang, Hejung
    Kang, Hong-Goo
    INTERSPEECH 2023, 2023, : 814 - 818
  • [5] Self-supervised end-to-end graph local clustering
    Zhe Yuan
    World Wide Web, 2023, 26 : 1157 - 1179
  • [6] Self-supervised end-to-end graph local clustering
    Yuan, Zhe
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2023, 26 (03): : 1157 - 1179
  • [7] Kaizen: Practical self-supervised continual learning with continual fine-tuning
    Tang, Chi Ian
    Qendrol, Lorena
    Spathis, Dimitris
    Kawsar, Fahim
    Mascolo, Cecilia
    Mathur, Akhil
    2024 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION, WACV 2024, 2024, : 2829 - 2838
  • [8] Self-Supervised Learning With Data-Efficient Supervised Fine-Tuning for Crowd Counting
    Wang, Rui
    Hao, Yixue
    Hu, Long
    Chen, Jincai
    Chen, Min
    Wu, Di
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 1538 - 1546
  • [9] End-to-end learning of self-rectification and self-supervised disparity prediction for stereo vision
    Zhang, Xuchong
    Zhao, Yongli
    Wang, Hang
    Zhai, Han
    Sun, Hongbin
    Zheng, Nanning
    NEUROCOMPUTING, 2022, 494 : 308 - 319
  • [10] Self-Supervised Representations Improve End-to-End Speech Translation
    Wu, Anne
    Wang, Changhan
    Pino, Juan
    Gu, Jiatao
    INTERSPEECH 2020, 2020, : 1491 - 1495