Improving speech recognition models with small samples for air traffic control systems

被引:36
作者
Lin, Yi [1 ]
Li, Qin [2 ]
Yang, Bo [1 ]
Yan, Zhen [1 ]
Tan, Huachun [3 ]
Chen, Zhengmao [1 ]
机构
[1] Sichuan Univ, Coll Comp Sci, Chengdu 610064, Peoples R China
[2] Beijing Inst Technol, Sch Mech Engn, Beijing 100081, Peoples R China
[3] Southeast Univ, Sch Transportat, Nanjing 211189, Peoples R China
基金
中国国家自然科学基金;
关键词
Air traffic control system; Automatic speech recognition; Deep learning; Pretraining; Small training samples; Transfer learning;
D O I
10.1016/j.neucom.2020.08.092
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the domain of air traffic control (ATC) systems, efforts to train a practical automatic speech recognition (ASR) model always faces the problem of small training samples since the collection and annotation of speech samples are expert-and domain-dependent task. In this work, a novel training approach based on pretraining and transfer learning is proposed to address this issue, and an improved end-to-end deep learning model is developed to address the specific challenges of ASR in the ATC domain. An unsupervised pretraining strategy is first proposed to learn speech representations from unlabeled samples for a certain dataset. Specifically, a masking strategy is applied to improve the diversity of the sample without losing their general patterns. Subsequently, transfer learning is applied to fine-tune a pretrained or other opti-mized baseline models to finally achieves the supervised ASR task. By virtue of the common terminology used in the ATC domain, the transfer learning task can be regarded as a sub-domain adaption task, in which the transferred model is optimized using a joint corpus consisting of baseline samples and new transcribed samples from the target dataset. This joint corpus construction strategy enriches the size and diversity of the training samples, which is important for addressing the issue of the small transcribed corpus. In addition, speed perturbation is applied to augment the new transcribed samples to further improve the quality of the speech corpus. Three real ATC datasets are used to validate the proposed ASR model and training strategies. The experimental results demonstrate that the ASR performance is sig-nificantly improved on all three datasets, with an absolute character error rate only one-third of that achieved through the supervised training. The applicability of the proposed strategies to other ASR approaches is also validated.& nbsp; (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页码:287 / 297
页数:11
相关论文
共 50 条
[1]  
Abe A, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P2849
[2]  
Amodei D., 2015, 33 INT C MACH LEARN
[3]  
[Anonymous], 2019, CROSS LINGUAL LANGUA
[4]  
Baevski A., 2019, vq-wav2vec: Self-supervised learning of discrete speech representations
[5]  
Baevski Alexei, 2019, CLOZE DRIVEN PRETRAI
[6]  
Bu Hui, 2017 20 C ORIENTAL C
[7]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[8]   Air traffic control speech recognition system cross-task & speaker adaptation [J].
de Cordoba, R. ;
Ferreiros, J. ;
San-Segundo, R. ;
Macias-Guarasa, J. ;
Montero, J. M. ;
Fernandez, F. ;
D'Haro, L. F. ;
Pardo, J. M. .
IEEE AEROSPACE AND ELECTRONIC SYSTEMS MAGAZINE, 2006, 21 (09) :12-17
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]   Unsupervised Visual Representation Learning by Context Prediction [J].
Doersch, Carl ;
Gupta, Abhinav ;
Efros, Alexei A. .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1422-1430