A review on speech recognition approaches and challenges for Portuguese: exploring the feasibility of fine-tuning large-scale end-to-end models

被引:0
作者
Li, Yan [1 ]
Wang, Yapeng [1 ]
Hoi, Lap Man [1 ]
Yang, Dingcheng [3 ]
Im, Sio-Kei [2 ]
机构
[1] Macao Polytech Univ, Fac Appl Sci, Macau, Peoples R China
[2] Macao Polytech Univ, Macau, Peoples R China
[3] Nanchang Univ, Sch Informat Engn, Nanchang, Peoples R China
来源
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING | 2025年 / 2025卷 / 01期
关键词
Portuguese speech recognition; Review; End-to-end models;
D O I
10.1186/s13636-024-00388-w
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
At present, automatic speech recognition has become an important bridge for human-computer interaction and is widely applied in multiple fields. The Portuguese speech recognition task is gradually receiving attention due to its unique language stance. However, the relatively scarce data resources have constrained the development and application of Portuguese speech recognition systems. The neglect of accent issues is also detrimental to the promotion of recognition systems. This study focuses on the research progress of end-to-end technology on Portuguese speech recognition task. It discusses relevant research from two directions: Brazilian Portuguese recognition and European Portuguese recognition, and organizes available corpus resources for potential researchers. Then, taking European Portuguese speech recognition as an example, it takes the Fairseq-S2T and Whisper as benchmarks tested on a 500-h European Portuguese dataset to estimate the performance of large-scale pre-trained models and fine-tuning techniques. Whisper obtained a WER of 5.11% which indicates that multilingual joint training can enhance the generalization ability. Finally, to the existing problems in Portuguese speech recognition, it explores future research directions, which provides new ideas for the next stage of research and system construction.
引用
收藏
页数:13
相关论文
共 59 条
  • [1] LSF and LPC - Derived Features for Large Vocabulary Distributed Continuous Speech Recognition in Brazilian Portuguese
    Alencar, V. F. S.
    Alcaim, A.
    [J]. 2008 42ND ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS, VOLS 1-4, 2008, : 1237 - 1241
  • [2] A Data-Centric Approach for Portuguese Speech Recognition: Language Model And Its Implications
    Alvarenga, Joao Paulo Reis
    Merschmann, Luiz Henrique de Campos
    Luz, Eduardo Jose da Silva
    [J]. IEEE LATIN AMERICA TRANSACTIONS, 2023, 21 (04) : 546 - 556
  • [3] [Anonymous], 2000, The CMU Pronunciation Dictionary
  • [4] Ardila R, 2020, Arxiv, DOI [arXiv:1912.06670, DOI 10.48550/ARXIV.1912.06670]
  • [5] Baevski A, 2020, ADV NEUR IN, V33
  • [6] JOINT UNSUPERVISED AND SUPERVISED TRAINING FOR MULTILINGUAL ASR
    Bai, Junwen
    Li, Bo
    Zhang, Yu
    Bapna, Ankur
    Siddhartha, Nikhil
    Sim, Khe Chai
    Sainath, Tara N.
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6402 - 6406
  • [7] Bansal S, 2019, Arxiv, DOI arXiv:1809.01431
  • [8] Bhable S.G., 2023, Comparative Analysis of Automatic Speech Recognition Techniques, V105, P897, DOI [10.2991/978-94-6463, DOI 10.2991/978-94-6463]
  • [9] CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese
    Candido Junior, Arnaldo
    Casanova, Edresson
    Soares, Anderson
    de Oliveira, Frederico Santos
    Oliveira, Lucas
    Fernandes Junior, Ricardo Corso
    Pinto da Silva, Daniel Peixoto
    Fayet, Fernando Gorgulho
    Carlotto, Bruno Baldissera
    Stefanel Gris, Lucas Rafael
    Aluisio, Sandra Maria
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2023, 57 (03) : 1139 - 1171
  • [10] Carvalho C., 2021, IBERSPEECH 2021 TRIB, P185, DOI [10.21437/IberSPEECH.2021-40, DOI 10.21437/IBERSPEECH.2021-40]