ON TIME-FREQUENCY MASK ESTIMATION FOR MVDR BEAMFORMING WITH APPLICATION IN ROBUST SPEECH RECOGNITION

被引：0

作者：

Xiao, Xiong ^{[1
]}

Zhao, Shengkui ^{[2
]}

Jones, Douglas L. ^{[2
]}

Chng, Eng Siong ^{[1
,3
]}

Li, Haizhou ^{[1
,3
,4
,5
]}

机构：

[1] Nanyang Technol Univ, Temasek Labs, Singapore, Singapore

[2] Adv Digital Sci Ctr, Singapore, Singapore

[3] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore

[4] Natl Univ Singapore, Dept ECE, Singapore, Singapore

[5] ASTAR, Inst Infocomm Res, Singapore, Singapore

来源：

2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2017年

关键词：

beamforming; robust speech recognition; timefrequency mask; neural networks; long short-term memory;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9% with 6-microphone configuration, which is much better than 12.0% achieved with the state-of-the-art MVDR implementation.

引用

页码：3246 / 3250

页数：5

共 50 条

[31] Manifold HLDA and its application to robust speech recognition
Kubo, Toshiaki
Ogawa, Tetsuji
Kobayashi, Tetsunori
INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1551 - 1554
[32] AN MCMC APPROACH TO JOINT ESTIMATION OF CLEAN SPEECH AND NOISE FOR ROBUST SPEECH RECOGNITION
Mushtaq, Aleem
Lee, Chin-Hui
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7107 - 7111
[33] Distant speech separation using predicted time-frequency masks from spatial features
Pertila, Pasi
Nikunen, Joonas
SPEECH COMMUNICATION, 2015, 68 : 97 - 106
[34] Stereo-input Speech Recognition using Sparseness-based Time-frequency Masking in a Reverberant Environment
Izumi, Yosuke
Nishiki, Kenta
Watanabe, Shinji
Nishimoto, Takuya
Ono, Nobutaka
Sagayama, Shigeki
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1907 - +
[35] ROBUST SPEECH RECOGNITION USING BEAMFORMING WITH ADAPTIVE MICROPHONE GAINS AND MULTICHANNEL NOISE REDUCTION
Zhao, Shengkui
Xiao, Xiong
Zhang, Zhaofeng
Thi Ngoc Tho Nguyen
Zhong, Xionghu
Ren, Bo
Wang, Longbiao
Jones, Douglas L.
Chng, Eng Siong
Li, Haizhou
2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 460 - 467
[36] ON SPATIAL FEATURES FOR SUPERVISED SPEECH SEPARATION AND ITS APPLICATION TO BEAMFORMING AND ROBUST ASR
Wang, Zhong-Qiu
Wang, DeLiang
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5709 - 5713
[37] Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend
Hori, Takaaki
Chen, Zhuo
Erdogan, Hakan
Hershey, John R.
Le Roux, Jonathan
Mitra, Vikramjit
Watanabe, Shinji
COMPUTER SPEECH AND LANGUAGE, 2017, 46 : 401 - 418
[38] EXPLORING PRACTICAL ASPECTS OF NEURAL MASK-BASED BEAMFORMING FOR FAR-FIELD SPEECH RECOGNITION
Boeddeker, Christoph
Erdogan, Hakan
Yoshioka, Takuya
Haeb-Umbach, Reinhold
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6697 - 6701
[39] A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition
Yapanel, Umit H.
Hansen, John H. L.
SPEECH COMMUNICATION, 2008, 50 (02) : 142 - 152
[40] Review of Time-Frequency Masking Approach for Improving Speech Intelligibility in Noise
Kim, Gibak
IETE TECHNICAL REVIEW, 2022, 39 (03) : 623 - 634

← 1 2 3 4 5 →