Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR

被引:101
作者
Higuchi, Takuya [1 ]
Ito, Nobutaka [1 ]
Araki, Shoko [1 ]
Yoshioka, Takuya [1 ]
Delcroix, Marc [1 ]
Nakatani, Tomohiro [1 ]
机构
[1] NTT Corp, NTT Commun Sci Lab, Kyoto 6190237, Japan
关键词
Beamforming; speech enhancement; speech recognition; time-frequency masking; CONVOLUTIONAL NEURAL-NETWORKS; BLIND SOURCE SEPARATION; PERMUTATION PROBLEM;
D O I
10.1109/TASLP.2017.2665341
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper considers acoustic beamforming for noise robust automatic speech recognition. A beamformer attenuates background noise by enhancing sound components coming from a direction specified by a steering vector. Hence, accurate steering vector estimation is paramount for successful noise reduction. Recently, time-frequency masking has been proposed to estimate the steering vectors that are used for a beamformer. In particular, we have developed a new form of this approach, which uses a speech spectral model based on a complex Gaussian mixture model (CGMM) to estimate the time-frequency masks needed for steering vector estimation, and extended the CGMM-based beamformer to an online speech enhancement scenario. Our previous experiments showed that the proposed CGMM-based approach outperforms a recently proposed mask estimator based on a Watson mixture model and the baseline speech enhancement system of the CHiME-3 challenge. This paper provides additional experimental results for our online processing, which achieves performance comparable to that of batch processing with a suitable block-batch size. This online version reduces the CHiME-3 word error rate (WER) on the evaluation set from 8.37% to 8.06%. Moreover, in this paper, we introduce a probabilistic prior distribution for a spatial correlation matrix (a CGMM parameter), which enables more stable steering vector estimation in the presence of interfering speakers. In practice, the performance of the proposed online beamformer degrades with observations that contain only noise or/and interference because of the failure of the CGMM parameter estimation. The introduced spatial prior enables the target speaker's parameter to avoid overfitting to noise or/and interference. Experimental results show that the spatial prior reduces the WER from 38.4% to 29.2% in a conversation recognition task compared with the CGMM-based approach without the prior, and outperforms a conventional online speech enhancement approach.
引用
收藏
页码:780 / 793
页数:14
相关论文
共 36 条
[1]   Convolutional Neural Networks for Speech Recognition [J].
Abdel-Hamid, Ossama ;
Mohamed, Abdel-Rahman ;
Jiang, Hui ;
Deng, Li ;
Penn, Gerald ;
Yu, Dong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545
[2]  
[Anonymous], 2014, ARXIV13124400V3
[3]  
[Anonymous], INT CONF ACOUST SPEE
[4]   Speaker indexing and speech enhancement in real meetings/conversations [J].
Araki, Shoko ;
Fujimoto, Masakiyo ;
Ishizuka, Kentaro ;
Sawada, Hiroshi ;
Makino, Shoji .
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, :93-96
[5]   Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors [J].
Araki, Shoko ;
Sawada, Hiroshi ;
Mukai, Ryo ;
Makino, Shoji .
SIGNAL PROCESSING, 2007, 87 (08) :1833-1847
[6]  
Araki S, 2016, INT CONF ACOUST SPEE, P385, DOI 10.1109/ICASSP.2016.7471702
[7]  
Barker J, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P504, DOI 10.1109/ASRU.2015.7404837
[8]  
Dang HTV, 2010, INT CONF ACOUST SPEE, P241, DOI 10.1109/ICASSP.2010.5495994
[9]   Strategies for distant speech recognition in reverberant environments [J].
Delcroix, Marc ;
Yoshioka, Takuya ;
Ogawa, Atsunori ;
Kubo, Yotaro ;
Fujimoto, Masakiyo ;
Ito, Nobutaka ;
Kinoshita, Keisuke ;
Espi, Miquel ;
Araki, Shoko ;
Hori, Takaaki ;
Nakatani, Tomohiro .
EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2015,
[10]  
DiBiase JH, 2001, DIGITAL SIGNAL PROC, P157