Speech Recognition for Air Traffic Control via Feature Learning and End-to-End Training

被引:3
作者
Fan, Peng [1 ]
Hua, Xiyao [2 ]
Lin, Yi [2 ]
Yang, Bo [2 ]
Zhang, Jianwei [2 ]
Ge, Wenyi [3 ]
Guo, Dongyue [1 ]
机构
[1] Sichuan Univ, Natl Key Lab Fundamental Sci Synthet Vis, Chengdu, Peoples R China
[2] Sichuan Univ, Coll Comp Sci, Chengdu, Peoples R China
[3] Chengdu Univ Informat Technol, Coll Comp Sci, Chengdu, Peoples R China
基金
中国国家自然科学基金;
关键词
  automatic speech recognition; feature learning; air traffic con-trol; multilingual; end-to-end training;
D O I
10.1587/transinf.2022EDP7151
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this work, we propose a new automatic speech recog-nition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems. The proposed model inte-grates the feature learning block, recurrent neural network (RNN), and con-nectionist temporal classification loss to build an end-to-end ASR model. Facing the complex environments of ATC speech, instead of the hand-crafted features, a learning block is designed to extract informative features from raw waveforms for acoustic modeling. Both the SincNet and 1D con-volution blocks are applied to process the raw waveforms, whose outputs are concatenated to the RNN layers for the temporal modeling. Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner, i.e., from wave-form to text. Finally, the multilingual issue in the ATC domain is also considered to achieve the ASR task by constructing a combined vocabu-lary of Chinese characters and English letters. The proposed approach is validated on a multilingual real-world corpus (ATCSpeech), and the exper-imental results demonstrate that the proposed approach outperforms other baselines, achieving a 6.9% character error rate.
引用
收藏
页码:538 / 544
页数:7
相关论文
共 29 条
  • [1] Amodei D, 2016, PR MACH LEARN RES, V48
  • [2] Survey on speech emotion recognition: Features, classification schemes, and databases
    El Ayadi, Moataz
    Kamel, Mohamed S.
    Karray, Fakhri
    [J]. PATTERN RECOGNITION, 2011, 44 (03) : 572 - 587
  • [3] Geacar C.M., 2010, P ICAS
  • [4] Graves A., 2006, P INT C MACH LEARN, P369, DOI 10.1145/1143844.1143891
  • [5] Conformer: Convolution-augmented Transformer for Speech Recognition
    Gulati, Anmol
    Qin, James
    Chiu, Chung-Cheng
    Parmar, Niki
    Zhang, Yu
    Yu, Jiahui
    Han, Wei
    Wang, Shibo
    Zhang, Zhengdong
    Wu, Yonghui
    Pang, Ruoming
    [J]. INTERSPEECH 2020, 2020, : 5036 - 5040
  • [6] Guo D., NAT GENET
  • [7] A Context-Aware Language Model to Improve the Speech Recognition in Air Traffic Control
    Guo, Dongyue
    Zhang, Zichen
    Fan, Peng
    Zhang, Jianwei
    Yang, Bo
    [J]. AEROSPACE, 2021, 8 (11)
  • [8] Juan Z.G., 2020, AUTOMATIC SPEECH REC
  • [9] Kim S, 2017, INT CONF ACOUST SPEE, P4835, DOI 10.1109/ICASSP.2017.7953075
  • [10] Kurzinger L., 2020, arXiv