A Unified Framework for Multilingual Speech Recognition in Air Traffic Control Systems

被引：81

作者：

Lin, Yi ^{[1
]}

Guo, Dongyue ^{[1
]}

Zhang, Jianwei ^{[1
]}

Chen, Zhengmao ^{[1
]}

Yang, Bo ^{[1
]}

机构：

[1] Sichuan Univ, Coll Comp Sci, Chengdu 610000, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2021年 / 32卷 / 08期

基金：

美国国家科学基金会;

关键词：

Hidden Markov models; Task analysis; Atmospheric modeling; Speech recognition; Vocabulary; Decoding; Real-time systems; Acoustic model (AM); air traffic control (ATC); machine translation pronunciation model (PM); multiscale CNN (MCNN); multilingual; robust speech recognition; DEEP NEURAL-NETWORKS;

D O I：

10.1109/TNNLS.2020.3015830

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This work focuses on robust speech recognition in air traffic control (ATC) by designing a novel processing paradigm to integrate multilingual speech recognition into a single framework using three cascaded modules: an acoustic model (AM), a pronunciation model (PM), and a language model (LM). The AM converts ATC speech into phoneme-based text sequences that the PM then translates into a word-based sequence, which is the ultimate goal of this research. The LM corrects both phoneme- and word-based errors in the decoding results. The AM, including the convolutional neural network (CNN) and recurrent neural network (RNN), considers the spatial and temporal dependences of the speech features and is trained by the connectionist temporal classification loss. To cope with radio transmission noise and diversity among speakers, a multiscale CNN architecture is proposed to fit the diverse data distributions and improve the performance. Phoneme-to-word translation is addressed via a proposed machine translation PM with an encoder-decoder architecture. RNN-based LMs are trained to consider the code-switching specificity of the ATC speech by building dependences with common words. We validate the proposed approach using large amounts of real Chinese and English ATC recordings and achieve a 3.95% label error rate on Chinese characters and English words, outperforming other popular approaches. The decoding efficiency is also comparable to that of the end-to-end model, and its generalizability is validated on several open corpora, making it suitable for real-time approaches to further support ATC applications, such as ATC prediction and safety checking.

引用

页码：3608 / 3620

页数：13

共 50 条

[41] PSEUDO-LABELING FOR MASSIVELY MULTILINGUAL SPEECH RECOGNITION
Lugosch, Loren
Likhomanenko, Tatiana
Synnaeve, Gabriel
Collobert, Ronan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7687 - 7691
[42] IXHEALTH: A Multilingual Platform for Advanced Speech Recognition in Healthcare
Jose Vivancos-Vicente, Pedro
Salvador Castejon-Garrido, Juan
Andres Paredes-Valverde, Mario
del Pilar Salas-Zarate, Maria
Valencia-Garcia, Rafael
TECHNOLOGIES AND INNOVATION, 2016, 658 : 26 - 38
[43] A hybrid CTC plus Attention model based on end-to-end framework for multilingual speech recognition
Liang, Sendong
Yan, Wei Qi
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (28) : 41295 - 41308
[44] Self-Adaptive Distillation for Multilingual Speech Recognition: Leveraging Student Independence
Leal, Isabel
Gaur, Neeraj
Haghani, Parisa
Farris, Brian
Moreno, Pedro J.
Prasad, Manasa
Ramabhadran, Bhuvana
Zhu, Yun
INTERSPEECH 2021, 2021, : 2556 - 2560
[45] DARTS-ASR: Differentiable Architecture Search for Multilingual Speech Recognition and Adaptation
Chen, Yi-Chen
Hsu, Jui-Yang
Lee, Cheng-Kuang
Lee, Hung-yi
INTERSPEECH 2020, 2020, : 1803 - 1807
[46] A framework for secure speech recognition
Smaragdis, Paris
Shashanka, Madhusudana V. S.
2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 969 - +
[47] A Framework for Speech Recognition Benchmarking
Dernoncourt, Franck
Trung Bui
Chang, Walter
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 169 - 170
[48] A framework for secure speech recognition
Smaragdis, Paris
Shashanka, Madhusudana
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (04): : 1404 - 1413
[49] SpecMark: A Spectral Watermarking Framework for IP Protection of Speech Recognition Systems
Chen, Huili
Darvish, Bita
Koushanfar, Farinaz
INTERSPEECH 2020, 2020, : 2312 - 2316
[50] MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation
Anwar, Mohamed
Shi, Bowen
Goswami, Vedanuj
Hsu, Wei-Ning
Pino, Juan
Wang, Changhan
INTERSPEECH 2023, 2023, : 4064 - 4068

← 1 2 3 4 5 →