ON TIME-FREQUENCY MASK ESTIMATION FOR MVDR BEAMFORMING WITH APPLICATION IN ROBUST SPEECH RECOGNITION

被引:0
作者
Xiao, Xiong [1 ]
Zhao, Shengkui [2 ]
Jones, Douglas L. [2 ]
Chng, Eng Siong [1 ,3 ]
Li, Haizhou [1 ,3 ,4 ,5 ]
机构
[1] Nanyang Technol Univ, Temasek Labs, Singapore, Singapore
[2] Adv Digital Sci Ctr, Singapore, Singapore
[3] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[4] Natl Univ Singapore, Dept ECE, Singapore, Singapore
[5] ASTAR, Inst Infocomm Res, Singapore, Singapore
来源
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2017年
关键词
beamforming; robust speech recognition; timefrequency mask; neural networks; long short-term memory;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9% with 6-microphone configuration, which is much better than 12.0% achieved with the state-of-the-art MVDR implementation.
引用
收藏
页码:3246 / 3250
页数:5
相关论文
共 50 条
  • [41] Analysis of CFA-BF: Novel combined fixed/adaptive beamforming for robust speech recognition in real car environments
    Hansen, John H. L.
    Zhang, Xianxian
    SPEECH COMMUNICATION, 2010, 52 (02) : 134 - 149
  • [42] Bin-Wise Combination of Time-Frequency Masking and Beamforming for Convolutive Source Separation
    Bella, Mostafa
    Saylani, Hicham
    Hosseini, Shahram
    Deville, Yannick
    2022 IEEE 24TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2022,
  • [43] ON NOISE ESTIMATION FOR ROBUST SPEECH RECOGNITION USING VECTOR TAYLOR SERIES
    Zhao, Yong
    Juang, Biing-Hwang
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4290 - 4293
  • [44] Combined Multi-channel NMF-based Robust Beamforming for Noisy Speech Recognition
    Mimura, Masato
    Bando, Yoshiaki
    Shimada, Kazuki
    Sakai, Shinsuke
    Yoshii, Kazuyoshi
    Kawahara, Tatsuya
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2451 - 2455
  • [45] Efficient MMSE Estimation and Uncertainty Processing for Multienvironment Robust Speech Recognition
    Gonzalez, Jose A.
    Peinado, Antonio M.
    Gomez, Angel M.
    Carmona, Jose L.
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (05): : 1206 - 1220
  • [46] Feature compensation based on independent noise estimation for robust speech recognition
    Lu, Yong
    Lin, Han
    Wu, Pingping
    Chen, Yitao
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [47] MMSE estimation of log-filterbank energies for robust speech recognition
    Stark, Anthony
    Paliwal, Kuldip
    SPEECH COMMUNICATION, 2011, 53 (03) : 403 - 416
  • [48] Feature compensation based on independent noise estimation for robust speech recognition
    Yong Lü
    Han Lin
    Pingping Wu
    Yitao Chen
    EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [49] Use of speech presence uncertainty with MMSE spectral energy estimation for robust automatic speech recognition
    Stark, Anthony
    Paliwal, Kuldip
    SPEECH COMMUNICATION, 2011, 53 (01) : 51 - 61
  • [50] Hardware design for blind source separation using fast time-frequency mask technique
    Tsai, Tsung-Han
    Liu, Pei-Yun
    Chiou, Yu-He
    INTEGRATION-THE VLSI JOURNAL, 2022, 82 : 67 - 77