A review on speech recognition approaches and challenges for Portuguese: exploring the feasibility of fine-tuning large-scale end-to-end models

被引:0
作者
Li, Yan [1 ]
Wang, Yapeng [1 ]
Hoi, Lap Man [1 ]
Yang, Dingcheng [3 ]
Im, Sio-Kei [2 ]
机构
[1] Macao Polytech Univ, Fac Appl Sci, Macau, Peoples R China
[2] Macao Polytech Univ, Macau, Peoples R China
[3] Nanchang Univ, Sch Informat Engn, Nanchang, Peoples R China
来源
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING | 2025年 / 2025卷 / 01期
关键词
Portuguese speech recognition; Review; End-to-end models;
D O I
10.1186/s13636-024-00388-w
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
At present, automatic speech recognition has become an important bridge for human-computer interaction and is widely applied in multiple fields. The Portuguese speech recognition task is gradually receiving attention due to its unique language stance. However, the relatively scarce data resources have constrained the development and application of Portuguese speech recognition systems. The neglect of accent issues is also detrimental to the promotion of recognition systems. This study focuses on the research progress of end-to-end technology on Portuguese speech recognition task. It discusses relevant research from two directions: Brazilian Portuguese recognition and European Portuguese recognition, and organizes available corpus resources for potential researchers. Then, taking European Portuguese speech recognition as an example, it takes the Fairseq-S2T and Whisper as benchmarks tested on a 500-h European Portuguese dataset to estimate the performance of large-scale pre-trained models and fine-tuning techniques. Whisper obtained a WER of 5.11% which indicates that multilingual joint training can enhance the generalization ability. Finally, to the existing problems in Portuguese speech recognition, it explores future research directions, which provides new ideas for the next stage of research and system construction.
引用
收藏
页数:13
相关论文
共 59 条
  • [31] A Corpus of Neutral Voice Speech in Brazilian Portuguese
    Leite, Pedro H. L.
    Hoyle, Edmundo
    Antelo, Alvaro
    Kruszielski, Luiz F.
    Biscainho, Luiz W. P.
    [J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2022, 2022, 13208 : 344 - 352
  • [32] SCALING END-TO-END MODELS FOR LARGE-SCALE MULTILINGUAL ASR
    Li, Bo
    Pang, Ruoming
    Sainath, Tara N.
    Gulati, Anmol
    Zhang, Yu
    Qin, James
    Haghani, Parisa
    Huang, W. Ronny
    Ma, Min
    Bai, Junwen
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 1011 - 1018
  • [33] Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring
    Li, Qiujia
    Zhang, Chao
    Woodland, Philip C.
    [J]. SPEECH COMMUNICATION, 2023, 147 : 12 - 21
  • [34] Automatic speech recognition: a survey
    Malik, Mishaim
    Malik, Muhammad Kamran
    Mehmood, Khawar
    Makhdoom, Imran
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (06) : 9411 - 9457
  • [35] Meinedo H, 2001, ASRU 2001: IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, CONFERENCE PROCEEDINGS, P319, DOI 10.1109/ASRU.2001.1034651
  • [36] Mohamed A.-R., 2009, NIPS WORKSHOP DEEP L, V1, P39
  • [37] Muniz A. L. M., 2019, 2019 Conference on Lasers and Electro-Optics Europe & European Quantum Electronics Conference (CLEO/Europe-EQEC), DOI 10.1109/CLEOE-EQEC.2019.8873106
  • [38] Speech Recognition Using Deep Neural Networks: A Systematic Review
    Nassif, Ali Bou
    Shahin, Ismail
    Attili, Imtinan
    Azzeh, Mohammad
    Shaalan, Khaled
    [J]. IEEE ACCESS, 2019, 7 : 19143 - 19165
  • [39] Neto J.P., 1997, 5 EUR C SPEECH COMM, P1707, DOI [10.21437/Eurospeech.1997-485, DOI 10.21437/EUROSPEECH.1997-485]
  • [40] Neto N., 2011, J. Braz. Comput. Soc, V17, P53, DOI DOI 10.1007/S13173-010-0023-1