Fine-Tuning Self-Supervised Learning Models for End-to-End Pronunciation Scoring

被引:0
|
作者
Zahran, Ahmed I. [1 ]
Fahmy, Aly A. [1 ]
Wassif, Khaled T. [1 ]
Bayomi, Hanaa [1 ]
机构
[1] Cairo Univ, Fac Comp & Artificial Intelligence, Giza 12613, Orman, Egypt
关键词
Automatic pronunciation assessment; pronunciation scoring; pre-trained speech representations; self-supervised speech representation learning; wav2vec; 2.0; WavLM; HuBERT;
D O I
10.1109/ACCESS.2023.3317236
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic pronunciation assessment models are regularly used in language learning applications. Common methodologies for pronunciation assessment use feature-based approaches, such as the Goodness-of-Pronunciation (GOP) approach, or deep learning speech recognition models to perform speech assessment. With the rise of transformers, pre-trained self-supervised learning (SSL) models have been utilized to extract contextual speech representations, showing improvements in various downstream tasks. In this study, we propose the end-to-end regressor (E2E-R) model for pronunciation scoring. E2E-R is trained using a two-step training process. In the first step, the pre-trained SSL model is fine-tuned on a phoneme recognition task to obtain better representations for the pronounced phonemes. In the second step, transfer learning is used to build a pronunciation scoring model that uses a Siamese neural network to compare the pronounced phoneme representations to embeddings of the canonical phonemes and produce the final pronunciation scores. E2E-R achieves a Pearson correlation coefficient (PCC) of 0.68, which is almost similar to the state-of-the-art GOPT-PAII model while eliminating the need for training on additional native speech data, feature engineering, or external forced alignment modules. To our knowledge, this work presents the first utilization of a pre-trained SSL model for end-to-end phoneme-level pronunciation scoring on raw speech waveforms. The code is available at https://github.com/ai-zahran/E2E-R.
引用
收藏
页码:112650 / 112663
页数:14
相关论文
共 50 条
  • [41] Depth Edge and Structure Optimization-Based End-to-End Self-Supervised Stereo Matching
    Yang, Wenbang
    Cheng, Xianjing
    Yong, Zhao
    Qian, Ren
    Li, Jianhua
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023, 37 (13)
  • [42] Towards End-to-End Unsupervised Saliency Detection with Self-Supervised Top-Down Context
    Song, Yicheng
    Gao, Shuyong
    Xing, Haozhe
    Cheng, Yiting
    Wang, Yan
    Zhang, Wenqiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5532 - 5541
  • [43] Fine-Tuning Self-Supervised Multilingual Sequence-To-Sequence Models for Extremely Low-Resource NMT
    Thillainathan, Sarubi
    Ranathunga, Surangika
    Jayasena, Sanath
    MORATUWA ENGINEERING RESEARCH CONFERENCE (MERCON 2021) / 7TH INTERNATIONAL MULTIDISCIPLINARY ENGINEERING RESEARCH CONFERENCE, 2021, : 432 - 437
  • [44] Predicting hemodynamic parameters based on arterial blood pressure waveform using self-supervised learning and fine-tuning
    Liao, Ke
    Elibol, Armagan
    Gao, Ziyan
    Meng, Lingzhong
    Chong, Nak Young
    APPLIED INTELLIGENCE, 2025, 55 (06)
  • [45] Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations
    Zaiem, Salah
    Parcollet, Titouan
    Essid, Slim
    INTERSPEECH 2023, 2023, : 67 - 71
  • [46] Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering
    Chang, Heng-Jui
    Liu, Alexander H.
    Glass, James
    INTERSPEECH 2023, 2023, : 2983 - 2987
  • [47] Comparison of computed tomography image features extracted by radiomics, self-supervised learning and end-to-end deep learning for outcome prediction of oropharyngeal cancer
    Ma, Baoqiang
    Guo, Jiapan
    Chu, Hung
    van Dijk, Lisanne V.
    van Ooijen, Peter M. A.
    Langendijk, Johannes A.
    Both, Stefan
    Sijtsema, Nanna M.
    PHYSICS & IMAGING IN RADIATION ONCOLOGY, 2023, 28
  • [48] NO NEED FOR A LEXICON? EVALUATING THE VALUE OF THE PRONUNCIATION LEXICA IN END-TO-END MODELS
    Sainath, Tara N.
    Prabhavalkar, Rohit
    Kumar, Shankar
    Lee, Seungji
    Kannan, Anjuli
    Rybach, David
    Schogol, Vlad
    Nguyen, Patrick
    Li, Bo
    Wu, Yonghui
    Chen, Zhifeng
    Chiu, Chung-Cheng
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5859 - 5863
  • [49] Introduction To Partial Fine-tuning: A Comprehensive Evaluation Of End-to-end Children's Automatic Speech Recognition Adaptation
    Rolland, Thomas
    Abad, Alberto
    INTERSPEECH 2024, 2024, : 5178 - 5182
  • [50] One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification
    Heo, Jungwoo
    Lim, Chan-yeong
    Kim, Ju-ho
    Shin, Hyun-seo
    Yu, Ha-Jin
    INTERSPEECH 2023, 2023, : 5271 - 5275