ON TIME-FREQUENCY MASK ESTIMATION FOR MVDR BEAMFORMING WITH APPLICATION IN ROBUST SPEECH RECOGNITION

被引:0
|
作者
Xiao, Xiong [1 ]
Zhao, Shengkui [2 ]
Jones, Douglas L. [2 ]
Chng, Eng Siong [1 ,3 ]
Li, Haizhou [1 ,3 ,4 ,5 ]
机构
[1] Nanyang Technol Univ, Temasek Labs, Singapore, Singapore
[2] Adv Digital Sci Ctr, Singapore, Singapore
[3] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[4] Natl Univ Singapore, Dept ECE, Singapore, Singapore
[5] ASTAR, Inst Infocomm Res, Singapore, Singapore
关键词
beamforming; robust speech recognition; timefrequency mask; neural networks; long short-term memory;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9% with 6-microphone configuration, which is much better than 12.0% achieved with the state-of-the-art MVDR implementation.
引用
收藏
页码:3246 / 3250
页数:5
相关论文
共 50 条
  • [41] Robust Automatic Speech Recognition with Decoder Oriented Ideal Binary Mask Estimation
    Kim, Lae-Hoon
    Kim, Kyung-Tae
    Hasegawa-Johnson, Mark
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2066 - 2069
  • [42] Minimax robust time-frequency filters for nonstationary signal estimation
    Matz, G
    Hlawatsch, F
    ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 1333 - 1336
  • [43] Minimax robust time-frequency filters for nonstationary signal estimation
    Vienna Univ of Technology, Wien, Austria
    ICASSP IEEE Int Conf Acoust Speech Signal Process Proc, (1333-1336):
  • [44] A Time-Frequency Domain Formant Frequency Estimation Scheme for Noisy Speech Signals
    Fattah, S. A.
    Zhu, W-P.
    Ahmad, M. O.
    ISCAS: 2009 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1-5, 2009, : 1201 - 1204
  • [45] Time-frequency plane Wiener filtering for robust processing of speech signals
    Ang, A
    Ang, EL
    Premkumar, AB
    Madhukumar, AS
    IEEE TENCON'97 - IEEE REGIONAL 10 ANNUAL CONFERENCE, PROCEEDINGS, VOLS 1 AND 2: SPEECH AND IMAGE TECHNOLOGIES FOR COMPUTING AND TELECOMMUNICATIONS, 1997, : 35 - 38
  • [46] Time-frequency mask estimation-based speech enhancement using deep encoder-decoder neural network
    SHI Wenhua
    ZHANG Xiongwei
    ZOU Xia
    SUN Meng
    LI Li
    REN Zhengbing
    Chinese Journal of Acoustics, 2021, 40 (01) : 141 - 154
  • [47] Weighting Time-Frequency Representation of Speech using Auditory Saliency for Automatic Speech Recognition
    Cong-Thanh Do
    Stylianou, Yannis
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1591 - 1595
  • [48] Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques
    Kolossa, D
    Klimas, A
    Orglmeister, R
    2005 WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA), 2005, : 82 - 85
  • [49] Improved robust features for speech recognition by integrating time-frequency principal components (TFPC) and histogram equalization (HEQ)
    Tsai, SM
    Lee, LS
    ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03, 2003, : 297 - 302
  • [50] Lightweight Fusion Model with Time-Frequency Features for Speech Emotion Recognition
    Zhang, Peng
    Li, Meijuan
    Zhao, Hui
    Chen, Yida
    Wang, Fuqiang
    Li, Ye
    Zhao, Wei
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 3017 - 3022