ON TIME-FREQUENCY MASK ESTIMATION FOR MVDR BEAMFORMING WITH APPLICATION IN ROBUST SPEECH RECOGNITION

被引:0
作者
Xiao, Xiong [1 ]
Zhao, Shengkui [2 ]
Jones, Douglas L. [2 ]
Chng, Eng Siong [1 ,3 ]
Li, Haizhou [1 ,3 ,4 ,5 ]
机构
[1] Nanyang Technol Univ, Temasek Labs, Singapore, Singapore
[2] Adv Digital Sci Ctr, Singapore, Singapore
[3] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[4] Natl Univ Singapore, Dept ECE, Singapore, Singapore
[5] ASTAR, Inst Infocomm Res, Singapore, Singapore
来源
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2017年
关键词
beamforming; robust speech recognition; timefrequency mask; neural networks; long short-term memory;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9% with 6-microphone configuration, which is much better than 12.0% achieved with the state-of-the-art MVDR implementation.
引用
收藏
页码:3246 / 3250
页数:5
相关论文
共 50 条
  • [21] Multi-Channel Bin-Wise Speech Separation Combining Time-Frequency Masking and Beamforming
    Bella, Mostafa
    Saylani, Hicham
    Hosseini, Shahram
    Deville, Yannick
    IEEE ACCESS, 2023, 11 : 100632 - 100645
  • [22] A Beamforming Algorithm Based on Maximum Likelihood of a Complex Gaussian Distribution With Time-Varying Variances for Robust Speech Recognition
    Cho, Byung Joon
    Lee, Jun-Min
    Park, Hyung-Min
    IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (09) : 1398 - 1402
  • [23] A Combined Time-Frequency Domain Beamforming Method for OFDM Systems
    Seydnejad, Saeid
    Akhzari, Sadegh
    2010 INTERNATIONAL ITG WORKSHOP ON SMART ANTENNAS (WSA 2010), 2010, : 292 - 299
  • [24] Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition
    Shimada, Kazuki
    Bando, Yoshiaki
    Mimura, Masato
    Itoyama, Katsutoshi
    Yoshii, Kazuyoshi
    Kawahara, Tatsuya
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (05) : 960 - 971
  • [25] A time-frequency smoothing neural network for speech enhancement
    Yuan, Wenhao
    SPEECH COMMUNICATION, 2020, 124 : 75 - 84
  • [26] Sequential estimation with optimal forgetting for robust speech recognition
    Afify, M
    Siohan, O
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2004, 12 (01): : 19 - 26
  • [27] Time and frequency filtering of filter-bank energies for robust HMM speech recognition
    Nadeu, C
    Macho, D
    Hernando, J
    SPEECH COMMUNICATION, 2001, 34 (1-2) : 93 - 114
  • [28] A Deep Neural Network Approach for Missing-Data Mask Estimation on Dual-Microphone Smartphones: Application to Noise-Robust Speech Recognition
    Lopez-Espejo, Ivan
    Gonzalez, Jose A.
    Gomez, Angel M.
    Peinado, Antonio M.
    ADVANCES IN SPEECH AND LANGUAGE TECHNOLOGIES FOR IBERIAN LANGUAGES, IBERSPEECH 2014, 2014, 8854 : 119 - 128
  • [29] A deep neural network approach for missing-data mask estimation on dual-microphone smartphones: Application to noise-robust speech recognition
    López-Espejo, I.
    González, José A.
    Gómez, Ángel M.
    Peinado, Antonio M.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8854 : 119 - 128
  • [30] DNN-BASED MASK ESTIMATION INTEGRATING SPECTRAL AND SPATIAL FEATURES FOR ROBUST BEAMFORMING
    Deng, Chengyun
    Song, Hui
    Zhang, Yi
    Sha, Yongtao
    Li, Xiangang
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4647 - 4651