Domain Adaptation Speech-to-Text for Low-Resource European Portuguese Using Deep Learning

被引:2
作者
Medeiros, Eduardo [1 ]
Corado, Leonel [1 ]
Rato, Luis [1 ,2 ]
Quaresma, Paulo [1 ,2 ]
Salgueiro, Pedro [1 ,2 ]
机构
[1] Univ Evora, Escola Ciencias & Tecnol, P-7000671 Evora, Portugal
[2] Univ Evora, Ctr ALGORITMI, Vista Lab, P-7000671 Evora, Portugal
关键词
machine learning; deep learning; deep neural networks; speech-to-text; automatic speech recognition; NVIDIA NeMo; GPUs; data-centric; Portuguese language; RECOGNITION;
D O I
10.3390/fi15050159
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic speech recognition (ASR), commonly known as speech-to-text, is the process of transcribing audio recordings into text, i.e., transforming speech into the respective sequence of words. This paper presents a deep learning ASR system optimization and evaluation for the European Portuguese language. We present a pipeline composed of several stages for data acquisition, analysis, pre-processing, model creation, and evaluation. A transfer learning approach is proposed considering an English language-optimized model as starting point; a target composed of European Portuguese; and the contribution to the transfer process by a source from a different domain consisting of a multiple-variant Portuguese language dataset, essentially composed of Brazilian Portuguese. A domain adaptation was investigated between European Portuguese and mixed (mostly Brazilian) Portuguese. The proposed optimization evaluation used the NVIDIA NeMo framework implementing the QuartzNet15x5 architecture based on 1D time-channel separable convolutions. Following this transfer learning data-centric approach, the model was optimized, achieving a state-of-the-art word error rate (WER) of 0.0503.
引用
收藏
页数:16
相关论文
共 34 条
[1]  
[Anonymous], 2015, P INT C MACH LEARN J
[2]  
[Anonymous], 1988, G711 INT TEL UN
[3]   Applying transfer learning and various ANN architectures to predict transportation mode choice in Amsterdam [J].
Buijs, Ruurd ;
Koch, Thomas ;
Dugundji, Elenna .
12TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT) / THE 4TH INTERNATIONAL CONFERENCE ON EMERGING DATA AND INDUSTRY 4.0 (EDI40) / AFFILIATED WORKSHOPS, 2021, 184 :532-540
[4]  
Cho J, 2018, IEEE W SP LANG TECH, P521, DOI 10.1109/SLT.2018.8639655
[5]  
Dalmia S, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4909, DOI 10.1109/ICASSP.2018.8461802
[6]   A survey on automatic speech recognition systems for Portuguese language and its variations [J].
de Lima, Thales Aguiar ;
Da Costa-Abreu, Marjory .
COMPUTER SPEECH AND LANGUAGE, 2020, 62
[7]  
Dimitriadis D., 2017, arXiv, DOI DOI 10.48550/ARXIV.1703.02136
[8]  
Eberhard David M., 2023, Ethnologue: Languages of the World
[9]  
Goodfellow I, 2016, ADAPT COMPUT MACH LE, P1
[10]  
Graves A, 2014, PR MACH LEARN RES, V32, P1764