A Unified Framework for Multilingual Speech Recognition in Air Traffic Control Systems

被引：82

作者：

Lin, Yi ^{[1
]}

Guo, Dongyue ^{[1
]}

Zhang, Jianwei ^{[1
]}

Chen, Zhengmao ^{[1
]}

Yang, Bo ^{[1
]}

机构：

[1] Sichuan Univ, Coll Comp Sci, Chengdu 610000, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2021年 / 32卷 / 08期

基金：

美国国家科学基金会;

关键词：

Hidden Markov models; Task analysis; Atmospheric modeling; Speech recognition; Vocabulary; Decoding; Real-time systems; Acoustic model (AM); air traffic control (ATC); machine translation pronunciation model (PM); multiscale CNN (MCNN); multilingual; robust speech recognition; DEEP NEURAL-NETWORKS;

D O I：

10.1109/TNNLS.2020.3015830

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This work focuses on robust speech recognition in air traffic control (ATC) by designing a novel processing paradigm to integrate multilingual speech recognition into a single framework using three cascaded modules: an acoustic model (AM), a pronunciation model (PM), and a language model (LM). The AM converts ATC speech into phoneme-based text sequences that the PM then translates into a word-based sequence, which is the ultimate goal of this research. The LM corrects both phoneme- and word-based errors in the decoding results. The AM, including the convolutional neural network (CNN) and recurrent neural network (RNN), considers the spatial and temporal dependences of the speech features and is trained by the connectionist temporal classification loss. To cope with radio transmission noise and diversity among speakers, a multiscale CNN architecture is proposed to fit the diverse data distributions and improve the performance. Phoneme-to-word translation is addressed via a proposed machine translation PM with an encoder-decoder architecture. RNN-based LMs are trained to consider the code-switching specificity of the ATC speech by building dependences with common words. We validate the proposed approach using large amounts of real Chinese and English ATC recordings and achieve a 3.95% label error rate on Chinese characters and English words, outperforming other popular approaches. The decoding efficiency is also comparable to that of the end-to-end model, and its generalizability is validated on several open corpora, making it suitable for real-time approaches to further support ATC applications, such as ATC prediction and safety checking.

引用

页码：3608 / 3620

页数：13

共 64 条

[1]

Abe A, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P2849

[2]

Amodei D, 2016, PR MACH LEARN RES, V48

[3]

[Anonymous], 2010, MAN IMPL ICAO LANG P, V2nd

[4] SPEECH ANALYSIS AND SYNTHESIS BY LINEAR PREDICTION OF SPEECH WAVE [J].

ATAL, BS ;

HANAUER, SL .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1971, 50 (02) :637-+

[5]

Aymen M., 2011, 2011 International Conference on Communications, Computing and Control Applications (CCCA), P1

[6]

Bahl L. R., 1986, ICASSP 86 Proceedings. IEEE-IECEJ-ASJ International Conference on Acoustics, Speech and Signal Processing (Cat. No.86CH2243-4), P49

[7]

Beerends JG, 2002, J AUDIO ENG SOC, V50, P765

[8]

Bengio Y, 2006, STUD FUZZ SOFT COMP, V194, P137

[9]

Bengio Y, 2001, ADV NEUR IN, V13, P932

[10]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

← 1 2 3 4 5 6 7 →