An efficient joint training model for monaural noisy-reverberant speech recognition

被引：0

作者：

Lian, Xiaoyu ^{[1
]}

Xia, Nan ^{[1
]}

Dai, Gaole ^{[1
]}

Yang, Hongqin ^{[1
]}

机构：

[1] Dalian Polytech Univ, Sch Informat Sci & Engn, Dalian 116034, Liaoning, Peoples R China

来源：

APPLIED ACOUSTICS | 2025年 / 228卷

关键词：

Deep learning; Speech enhancement; Speech recognition; Attention mechanism; Joint training; NETWORKS; ENHANCEMENT; FRAMEWORK; SIGNAL;

D O I：

10.1016/j.apacoust.2024.110322

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Noise and reverberation can seriously reduce speech quality and intelligibility, affecting the performance of downstream speech recognition tasks. This paper constructs a joint training speech recognition network for speech recognition in monaural noisy-reverberant environments. In the speech enhancement model, a complex-valued channel and temporal-frequency attention (CCTFA) are integrated to focus on the key features of the complex spectrum. Then the CCTFA network (CCTFANet) is constructed to reduce the influence of noise and reverberation. In the speech recognition model, an element-wise linear attention (EWLA) module is proposed to linearize the attention complexity and reduce the number of parameters and computations required for the attention module. Then the EWLA Conformer (EWLAC) is constructed as an efficient end-to-end speech recognition model. On the open source dataset, joint training of CCTFANet with EWLAC reduces the CER by 3.27%. Compared to other speech recognition models, EWLAC maintains CER while achieving much lower parameter count, computational overhead, and higher inference speed.

引用

页数：13

共 50 条

[41] Model Compensation Approach Based on Nonuniform Spectral Compression Features for Noisy Speech Recognition
Geng-Xin Ning
Gang Wei
Kam-Keung Chu
EURASIP Journal on Advances in Signal Processing, 2007
[42] Speech recognition for noisy conditions based on discrete wavelet transform and parallel model combination
Hu, CH
Liu, XF
ICEMI 2005: Conference Proceedings of the Seventh International Conference on Electronic Measurement & Instruments, Vol 1, 2005, : 408 - 411
[43] Model compensation approach based on nonuniform spectral compression features for noisy speech recognition
Ning, Geng-Xin
Wei, Gang
Chu, Kam-Keung
EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2007, 2007 (1)
[44] Train from scratch: Single-stage joint training of speech separation and recognition
Shi, Jing
Chang, Xuankai
Watanabe, Shinji
Xu, Bo
COMPUTER SPEECH AND LANGUAGE, 2022, 76
[45] Improving Speech Recognition with Augmented Synthesized Data and Conditional Model Training
Xue, Shaofei
Tang, Jian
Liu, Yazhu
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 443 - 447
[46] Efficient Self-Attention Model for Speech Recognition-Based Assistive Robots Control
Poirier, Samuel
Cote-Allard, Ulysse
Routhier, Francois
Campeau-Lecours, Alexandre
SENSORS, 2023, 23 (13)
[47] REAL-TIME SPEECH ENHANCEMENT IN NOISY REVERBERANT MULTI-TALKER ENVIRONMENTS BASED ON A LOCATION-INDEPENDENT ROOM ACOUSTICS MODEL
Nakatani, Tomohiro
Yoshioka, Takuya
Kinoshita, Keisuke
Miyoshi, Masato
Juang, Biing-Hwang
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 137 - 140
[48] Efficient Language Model Adaptation for Automatic Speech Recognition of Spoken Translations
Pelemans, Joris
Vanallemeersch, Tom
Demuynck, Kris
Van Hamme, Hugo
Wambacq, Patrick
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2262 - 2266
[49] JOINT SEPARATION AND DENOISING OF NOISY MULTI-TALKER SPEECH USING RECURRENT NEURAL NETWORKS AND PERMUTATION INVARIANT TRAINING
Kolbaek, Morten
Yu, Dong
Tan, Zheng-Hua
Jensen, Jesper
2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
[50] Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition
Chen, Xie
Liu, Xunying
Wang, Yongqiang
Gales, Mark J. F.
Woodland, Philip C.
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) : 2146 - 2157

← 1 2 3 4 5 →