ON TIME-FREQUENCY MASK ESTIMATION FOR MVDR BEAMFORMING WITH APPLICATION IN ROBUST SPEECH RECOGNITION

被引:0
作者
Xiao, Xiong [1 ]
Zhao, Shengkui [2 ]
Jones, Douglas L. [2 ]
Chng, Eng Siong [1 ,3 ]
Li, Haizhou [1 ,3 ,4 ,5 ]
机构
[1] Nanyang Technol Univ, Temasek Labs, Singapore, Singapore
[2] Adv Digital Sci Ctr, Singapore, Singapore
[3] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[4] Natl Univ Singapore, Dept ECE, Singapore, Singapore
[5] ASTAR, Inst Infocomm Res, Singapore, Singapore
来源
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2017年
关键词
beamforming; robust speech recognition; timefrequency mask; neural networks; long short-term memory;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9% with 6-microphone configuration, which is much better than 12.0% achieved with the state-of-the-art MVDR implementation.
引用
收藏
页码:3246 / 3250
页数:5
相关论文
共 50 条
  • [31] Manifold HLDA and its application to robust speech recognition
    Kubo, Toshiaki
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1551 - 1554
  • [32] AN MCMC APPROACH TO JOINT ESTIMATION OF CLEAN SPEECH AND NOISE FOR ROBUST SPEECH RECOGNITION
    Mushtaq, Aleem
    Lee, Chin-Hui
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7107 - 7111
  • [33] Distant speech separation using predicted time-frequency masks from spatial features
    Pertila, Pasi
    Nikunen, Joonas
    SPEECH COMMUNICATION, 2015, 68 : 97 - 106
  • [34] Stereo-input Speech Recognition using Sparseness-based Time-frequency Masking in a Reverberant Environment
    Izumi, Yosuke
    Nishiki, Kenta
    Watanabe, Shinji
    Nishimoto, Takuya
    Ono, Nobutaka
    Sagayama, Shigeki
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1907 - +
  • [35] ROBUST SPEECH RECOGNITION USING BEAMFORMING WITH ADAPTIVE MICROPHONE GAINS AND MULTICHANNEL NOISE REDUCTION
    Zhao, Shengkui
    Xiao, Xiong
    Zhang, Zhaofeng
    Thi Ngoc Tho Nguyen
    Zhong, Xionghu
    Ren, Bo
    Wang, Longbiao
    Jones, Douglas L.
    Chng, Eng Siong
    Li, Haizhou
    2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 460 - 467
  • [36] ON SPATIAL FEATURES FOR SUPERVISED SPEECH SEPARATION AND ITS APPLICATION TO BEAMFORMING AND ROBUST ASR
    Wang, Zhong-Qiu
    Wang, DeLiang
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5709 - 5713
  • [37] Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend
    Hori, Takaaki
    Chen, Zhuo
    Erdogan, Hakan
    Hershey, John R.
    Le Roux, Jonathan
    Mitra, Vikramjit
    Watanabe, Shinji
    COMPUTER SPEECH AND LANGUAGE, 2017, 46 : 401 - 418
  • [38] EXPLORING PRACTICAL ASPECTS OF NEURAL MASK-BASED BEAMFORMING FOR FAR-FIELD SPEECH RECOGNITION
    Boeddeker, Christoph
    Erdogan, Hakan
    Yoshioka, Takuya
    Haeb-Umbach, Reinhold
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6697 - 6701
  • [39] A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition
    Yapanel, Umit H.
    Hansen, John H. L.
    SPEECH COMMUNICATION, 2008, 50 (02) : 142 - 152
  • [40] Review of Time-Frequency Masking Approach for Improving Speech Intelligibility in Noise
    Kim, Gibak
    IETE TECHNICAL REVIEW, 2022, 39 (03) : 623 - 634