Fine-Tuning Self-Supervised Learning Models for End-to-End Pronunciation Scoring

被引：0

作者：

Zahran, Ahmed I. ^{[1
]}

Fahmy, Aly A. ^{[1
]}

Wassif, Khaled T. ^{[1
]}

Bayomi, Hanaa ^{[1
]}

机构：

[1] Cairo Univ, Fac Comp & Artificial Intelligence, Giza 12613, Orman, Egypt

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Automatic pronunciation assessment; pronunciation scoring; pre-trained speech representations; self-supervised speech representation learning; wav2vec; 2.0; WavLM; HuBERT;

D O I：

10.1109/ACCESS.2023.3317236

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Automatic pronunciation assessment models are regularly used in language learning applications. Common methodologies for pronunciation assessment use feature-based approaches, such as the Goodness-of-Pronunciation (GOP) approach, or deep learning speech recognition models to perform speech assessment. With the rise of transformers, pre-trained self-supervised learning (SSL) models have been utilized to extract contextual speech representations, showing improvements in various downstream tasks. In this study, we propose the end-to-end regressor (E2E-R) model for pronunciation scoring. E2E-R is trained using a two-step training process. In the first step, the pre-trained SSL model is fine-tuned on a phoneme recognition task to obtain better representations for the pronounced phonemes. In the second step, transfer learning is used to build a pronunciation scoring model that uses a Siamese neural network to compare the pronounced phoneme representations to embeddings of the canonical phonemes and produce the final pronunciation scores. E2E-R achieves a Pearson correlation coefficient (PCC) of 0.68, which is almost similar to the state-of-the-art GOPT-PAII model while eliminating the need for training on additional native speech data, feature engineering, or external forced alignment modules. To our knowledge, this work presents the first utilization of a pre-trained SSL model for end-to-end phoneme-level pronunciation scoring on raw speech waveforms. The code is available at https://github.com/ai-zahran/E2E-R.

引用

页码：112650 / 112663

页数：14

共 50 条

[21] Fine-Tuning for Bayer Demosaicking Through Periodic-Consistent Self-Supervised Learning
Liu, Chang
He, Songze
Xu, Jiajun
Li, Jia
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 989 - 993
[22] END-TO-END MUSIC REMASTERING SYSTEM USING SELF-SUPERVISED AND ADVERSARIAL TRAINING
Koo, Junghyun
Paik, Seungryeol
Lee, Kyogu
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4608 - 4612
[23] Investigating Self-supervised Pre-training for End-to-end Speech Translation
Ha Nguyen
Bougares, Fethi
Tomashenko, Natalia
Esteve, Yannick
Besacier, Laurent
INTERSPEECH 2020, 2020, : 1466 - 1470
[24] PVStereo: Pyramid Voting Module for End-to-End Self-Supervised Stereo Matching
Wang, Hengli
Fan, Rui
Cai, Peide
Liu, Ming
IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (03) : 4353 - 4360
[25] FINE-TUNING STRATEGIES FOR FASTER INFERENCE USING SPEECH SELF-SUPERVISED MODELS: A COMPARATIVE STUDY
Zaiem, Salah
Algayres, Robin
Parcollet, Titouan
Essid, Slim
Ravanelli, Mirco
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
[26] Self-supervised Fine-tuning for Efficient Passage Re-ranking
Kim, Meoungjun
Ko, Youngjoong
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3142 - 3146
[27] Exploiting Fine-tuning of Self-supervised Learning Models for Improving Bi-modal Sentiment Analysis and Emotion Recognition
Yang, Wei
Fukayama, Satoru
Heracleous, Panikos
Ogata, Jun
INTERSPEECH 2022, 2022, : 1998 - 2002
[28] End-to-End Learning from Noisy Crowd to Supervised Machine Learning Models
Younesian, Taraneh
Hong, Chi
Ghiassi, Amirmasoud
Birke, Robert
Chen, Lydia Y.
2020 IEEE SECOND INTERNATIONAL CONFERENCE ON COGNITIVE MACHINE INTELLIGENCE (COGMI 2020), 2020, : 17 - 26
[29] Learning end-to-end patient representations through self-supervised covariate balancing for causal treatment effect estimation
Tesei, Gino
Giampanis, Stefanos
Shi, Jingpu
Norgeot, Beau
JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 140
[30] FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition
Chen, Szu-Jui
Xie, Jiamin
Hansen, John H. L.
INTERSPEECH 2022, 2022, : 3058 - 3062

← 1 2 3 4 5 →