An iterative mask estimation approach to deep learning based multi-channel speech recognition

被引:19
作者
Tu, Yan-Hui [1 ]
Du, Jun [1 ]
Sun, Lei [1 ]
Ma, Feng [2 ]
Wang, Hai-Kun [2 ]
Chen, Jing-Dong [3 ]
Lee, Chin-Hui [4 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] iFlytek Co Ltd, Hefei, Anhui, Peoples R China
[3] Northwestern Polytech Univ, Xian, Shaanxi, Peoples R China
[4] Georgia Inst Technol, Atlanta, GA 30332 USA
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
CHiME challenge; Deep learning; Ideal ratio mask (IRM); Microphone array; Robust speech recognition; BLIND SOURCE SEPARATION; ENHANCEMENT; NOISE;
D O I
10.1016/j.specom.2018.11.005
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a novel iterative mask estimation (IME) framework to improve the state-of-the-art complex Gaussian mixture model (CGMM)-based beamforming approach in an iterative manner by leveraging upon the complementary information obtained from different deep models. Although CGMM has been recently demonstrated to be quite effective for multi-channel, automation speech recognition (ASR) in operational scenarios, the corresponding mask estimation, however, is not always accurate in adverse environments due to the lack of prior or context information. To address this problem, in this study, a neural-network-based ideal ratio mask estimator learned from a multi-condition data set is first adopted to incorporate prior information, obtained from the speech/noise interactions and the long acoustic context, into CGMM-based beamformed speech that has a higher signal-to-noise ratio (SNR) than the original noisy speech signal. Next, to further utilize the rich context information in deep acoustic and language models, voice activity detection information, obtained from speech recognition results, is then used to refine mask estimation, yielding a significant reduction in insertion errors. During testing on the recently launched CHiME-4 Challenge ASR task of recognizing 6-channel microphone array speech, the proposed IME approach significantly and consistently outperforms the CGMM approach under different configurations, with relative word error rate reductions ranging from 20% to 30%. Furthermore, the IME approach plays a key role in the ensemble system that achieves the best performance in the CHiME-4 Challenge.
引用
收藏
页码:31 / 43
页数:13
相关论文
共 50 条
  • [1] A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-channel Speech Recognition in the CHiME-6 Challenge
    Tu, Yan-Hui
    Du, Jun
    Sun, Lei
    Ma, Feng
    Pan, Jia
    Lee, Chin-Hui
    INTERSPEECH 2020, 2020, : 96 - 100
  • [2] An information fusion framework with multi-channel feature concatenation and multi-perspective system combination for the deep-learning-based robust recognition of microphone array speech
    Tu, Dxxyan-Hui
    Du, Jun
    Wang, Qing
    Bao, Xiao
    Dai, Li-Rong
    Lee, Chi-Hui
    COMPUTER SPEECH AND LANGUAGE, 2017, 46 : 517 - 534
  • [3] On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones
    Tu, Yan-Hui
    Du, Jun
    Sun, Lei
    Ma, Feng
    Lee, Chin-Hui
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 394 - 398
  • [4] MULTI-CHANNEL AUTOMATIC SPEECH RECOGNITION USING DEEP COMPLEX UNET
    Kong, Yuxiang
    Wu, Jian
    Wang, Quandong
    Gao, Peng
    Zhuang, Weiji
    Wang, Yujun
    Xie, Lei
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 104 - 110
  • [5] Acoustic Model Combination Incorporated With Mask-Based Multi-Channel Source Separation for Automatic Speech Recognition
    Yoon, Jae Sam
    Park, Ji Hun
    Kim, Hong Kook
    Kim, Hoirin
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (05) : 772 - 784
  • [6] A Deep Learning Approach to Multi-Channel and Multi-Microphone Acoustic Echo Cancellation
    Zhang, Hao
    Wang, DeLiang
    INTERSPEECH 2021, 2021, : 1139 - 1143
  • [7] DMANET: DEEP LEARNING-BASED DIFFERENTIAL MICROPHONE ARRAYS FOR MULTI-CHANNEL SPEECH SEPARATION
    Yang, Xiaokang
    Wei, Jianguo
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4363 - 4367
  • [8] Deep MCANC: A deep learning approach to multi-channel active noise control
    Zhang, Hao
    Wang, DeLiang
    NEURAL NETWORKS, 2023, 158 : 318 - 327
  • [9] Speech distortion weighted multi-channel Wiener filter and its application to speech recognition
    Kim, Gibak
    IEICE ELECTRONICS EXPRESS, 2015, 12 (06): : 1 - 7
  • [10] Multi-channel sub-band speech recognition
    McCowan I.A.
    Sridharan S.
    EURASIP Journal on Advances in Signal Processing, 2001 (1) : 45 - 52