Fine-Tuning Self-Supervised Learning Models for End-to-End Pronunciation Scoring

被引：0

作者：

Zahran, Ahmed I. ^{[1
]}

Fahmy, Aly A. ^{[1
]}

Wassif, Khaled T. ^{[1
]}

Bayomi, Hanaa ^{[1
]}

机构：

[1] Cairo Univ, Fac Comp & Artificial Intelligence, Giza 12613, Orman, Egypt

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Automatic pronunciation assessment; pronunciation scoring; pre-trained speech representations; self-supervised speech representation learning; wav2vec; 2.0; WavLM; HuBERT;

D O I：

10.1109/ACCESS.2023.3317236

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Automatic pronunciation assessment models are regularly used in language learning applications. Common methodologies for pronunciation assessment use feature-based approaches, such as the Goodness-of-Pronunciation (GOP) approach, or deep learning speech recognition models to perform speech assessment. With the rise of transformers, pre-trained self-supervised learning (SSL) models have been utilized to extract contextual speech representations, showing improvements in various downstream tasks. In this study, we propose the end-to-end regressor (E2E-R) model for pronunciation scoring. E2E-R is trained using a two-step training process. In the first step, the pre-trained SSL model is fine-tuned on a phoneme recognition task to obtain better representations for the pronounced phonemes. In the second step, transfer learning is used to build a pronunciation scoring model that uses a Siamese neural network to compare the pronounced phoneme representations to embeddings of the canonical phonemes and produce the final pronunciation scores. E2E-R achieves a Pearson correlation coefficient (PCC) of 0.68, which is almost similar to the state-of-the-art GOPT-PAII model while eliminating the need for training on additional native speech data, feature engineering, or external forced alignment modules. To our knowledge, this work presents the first utilization of a pre-trained SSL model for end-to-end phoneme-level pronunciation scoring on raw speech waveforms. The code is available at https://github.com/ai-zahran/E2E-R.

引用

页码：112650 / 112663

页数：14

共 50 条

[41] Depth Edge and Structure Optimization-Based End-to-End Self-Supervised Stereo Matching
Yang, Wenbang
Cheng, Xianjing
Yong, Zhao
Qian, Ren
Li, Jianhua
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023, 37 (13)
[42] Towards End-to-End Unsupervised Saliency Detection with Self-Supervised Top-Down Context
Song, Yicheng
Gao, Shuyong
Xing, Haozhe
Cheng, Yiting
Wang, Yan
Zhang, Wenqiang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5532 - 5541
[43] Fine-Tuning Self-Supervised Multilingual Sequence-To-Sequence Models for Extremely Low-Resource NMT
Thillainathan, Sarubi
Ranathunga, Surangika
Jayasena, Sanath
MORATUWA ENGINEERING RESEARCH CONFERENCE (MERCON 2021) / 7TH INTERNATIONAL MULTIDISCIPLINARY ENGINEERING RESEARCH CONFERENCE, 2021, : 432 - 437
[44] Predicting hemodynamic parameters based on arterial blood pressure waveform using self-supervised learning and fine-tuning
Liao, Ke
Elibol, Armagan
Gao, Ziyan
Meng, Lingzhong
Chong, Nak Young
APPLIED INTELLIGENCE, 2025, 55 (06)
[45] Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations
Zaiem, Salah
Parcollet, Titouan
Essid, Slim
INTERSPEECH 2023, 2023, : 67 - 71
[46] Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering
Chang, Heng-Jui
Liu, Alexander H.
Glass, James
INTERSPEECH 2023, 2023, : 2983 - 2987
[47] Comparison of computed tomography image features extracted by radiomics, self-supervised learning and end-to-end deep learning for outcome prediction of oropharyngeal cancer
Ma, Baoqiang
Guo, Jiapan
Chu, Hung
van Dijk, Lisanne V.
van Ooijen, Peter M. A.
Langendijk, Johannes A.
Both, Stefan
Sijtsema, Nanna M.
PHYSICS & IMAGING IN RADIATION ONCOLOGY, 2023, 28
[48] NO NEED FOR A LEXICON? EVALUATING THE VALUE OF THE PRONUNCIATION LEXICA IN END-TO-END MODELS
Sainath, Tara N.
Prabhavalkar, Rohit
Kumar, Shankar
Lee, Seungji
Kannan, Anjuli
Rybach, David
Schogol, Vlad
Nguyen, Patrick
Li, Bo
Wu, Yonghui
Chen, Zhifeng
Chiu, Chung-Cheng
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5859 - 5863
[49] Introduction To Partial Fine-tuning: A Comprehensive Evaluation Of End-to-end Children's Automatic Speech Recognition Adaptation
Rolland, Thomas
Abad, Alberto
INTERSPEECH 2024, 2024, : 5178 - 5182
[50] One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification
Heo, Jungwoo
Lim, Chan-yeong
Kim, Ju-ho
Shin, Hyun-seo
Yu, Ha-Jin
INTERSPEECH 2023, 2023, : 5271 - 5275

← 1 2 3 4 5 →