An efficient joint training model for monaural noisy-reverberant speech recognition

被引:0
作者
Lian, Xiaoyu [1 ]
Xia, Nan [1 ]
Dai, Gaole [1 ]
Yang, Hongqin [1 ]
机构
[1] Dalian Polytech Univ, Sch Informat Sci & Engn, Dalian 116034, Liaoning, Peoples R China
关键词
Deep learning; Speech enhancement; Speech recognition; Attention mechanism; Joint training; NETWORKS; ENHANCEMENT; FRAMEWORK; SIGNAL;
D O I
10.1016/j.apacoust.2024.110322
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Noise and reverberation can seriously reduce speech quality and intelligibility, affecting the performance of downstream speech recognition tasks. This paper constructs a joint training speech recognition network for speech recognition in monaural noisy-reverberant environments. In the speech enhancement model, a complex-valued channel and temporal-frequency attention (CCTFA) are integrated to focus on the key features of the complex spectrum. Then the CCTFA network (CCTFANet) is constructed to reduce the influence of noise and reverberation. In the speech recognition model, an element-wise linear attention (EWLA) module is proposed to linearize the attention complexity and reduce the number of parameters and computations required for the attention module. Then the EWLA Conformer (EWLAC) is constructed as an efficient end-to-end speech recognition model. On the open source dataset, joint training of CCTFANet with EWLAC reduces the CER by 3.27%. Compared to other speech recognition models, EWLAC maintains CER while achieving much lower parameter count, computational overhead, and higher inference speed.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Model Compensation Approach Based on Nonuniform Spectral Compression Features for Noisy Speech Recognition
    Geng-Xin Ning
    Gang Wei
    Kam-Keung Chu
    EURASIP Journal on Advances in Signal Processing, 2007
  • [42] Speech recognition for noisy conditions based on discrete wavelet transform and parallel model combination
    Hu, CH
    Liu, XF
    ICEMI 2005: Conference Proceedings of the Seventh International Conference on Electronic Measurement & Instruments, Vol 1, 2005, : 408 - 411
  • [43] Model compensation approach based on nonuniform spectral compression features for noisy speech recognition
    Ning, Geng-Xin
    Wei, Gang
    Chu, Kam-Keung
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2007, 2007 (1)
  • [44] Train from scratch: Single-stage joint training of speech separation and recognition
    Shi, Jing
    Chang, Xuankai
    Watanabe, Shinji
    Xu, Bo
    COMPUTER SPEECH AND LANGUAGE, 2022, 76
  • [45] Improving Speech Recognition with Augmented Synthesized Data and Conditional Model Training
    Xue, Shaofei
    Tang, Jian
    Liu, Yazhu
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 443 - 447
  • [46] Efficient Self-Attention Model for Speech Recognition-Based Assistive Robots Control
    Poirier, Samuel
    Cote-Allard, Ulysse
    Routhier, Francois
    Campeau-Lecours, Alexandre
    SENSORS, 2023, 23 (13)
  • [47] REAL-TIME SPEECH ENHANCEMENT IN NOISY REVERBERANT MULTI-TALKER ENVIRONMENTS BASED ON A LOCATION-INDEPENDENT ROOM ACOUSTICS MODEL
    Nakatani, Tomohiro
    Yoshioka, Takuya
    Kinoshita, Keisuke
    Miyoshi, Masato
    Juang, Biing-Hwang
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 137 - 140
  • [48] Efficient Language Model Adaptation for Automatic Speech Recognition of Spoken Translations
    Pelemans, Joris
    Vanallemeersch, Tom
    Demuynck, Kris
    Van Hamme, Hugo
    Wambacq, Patrick
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2262 - 2266
  • [49] JOINT SEPARATION AND DENOISING OF NOISY MULTI-TALKER SPEECH USING RECURRENT NEURAL NETWORKS AND PERMUTATION INVARIANT TRAINING
    Kolbaek, Morten
    Yu, Dong
    Tan, Zheng-Hua
    Jensen, Jesper
    2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
  • [50] Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition
    Chen, Xie
    Liu, Xunying
    Wang, Yongqiang
    Gales, Mark J. F.
    Woodland, Philip C.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (11) : 2146 - 2157