Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

被引:44
作者
Shimada, Kazuki [1 ]
Bando, Yoshiaki [1 ]
Mimura, Masato [1 ]
Itoyama, Katsutoshi [1 ]
Yoshii, Kazuyoshi [1 ,2 ]
Kawahara, Tatsuya [1 ]
机构
[1] Kyoto Univ, Grad Sch Informat, Kyoto 6068501, Japan
[2] RIKEN, Ctr Adv Intelligence Project, Tokyo 1030027, Japan
关键词
Noisy speech recognition; speech enhancement; multichannel nonnegative matrix factorization; beamforming; CONVOLUTIVE MIXTURES; NEURAL-NETWORKS; SEPARATION; SINGLE; MODEL;
D O I
10.1109/TASLP.2019.2907015
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper describes multichannel speech enhancement for improving automatic speech recognition (ASR) in noisy environments. Recently, the minimum variance distortionless response (MVDR) beamforming has widely been used because it works well if the steering vector of speech and the spatial covariance matrix (SCM) of noise are given. To estimating such spatial information, conventional studies take a supervised approach that classifies each time-frequency (TF) bin into noise or speech by training a deep neural network (DNN). The performance of ASR, however, is degraded in an unknown noisy environment. To solve this problem, we take an unsupervised approach that decomposes each TF bin into the sum of speech and noise by using multichannel nonnegative matrix factorization (MNMF). This enables us to accurately estimate the SCMs of speech and noise not from observed noisy mixtures but from separated speech and noise components. In this paper, we propose online MVDR beamforming by effectively initializing and incrementally updating the parameters of MNMF. Another main contribution is to comprehensively investigate the performances of ASR obtained by various types of spatial filters, i.e., time-invariant and variant versions of MVDR beamformers and those of rank-1 and full-rank multichannel Wiener filters, in combination with MNMF. The experimental results showed that the proposed method outperformed the state-of-the-art DNN-based beamforming method in unknown environments that did not match training data.
引用
收藏
页码:960 / 971
页数:12
相关论文
共 52 条
[11]   Improved MVDR beamforming using single-channel mask prediction networks [J].
Erdogan, Hakan ;
Hershey, John ;
Watanabe, Shinji ;
Mandel, Michael ;
Le Roux, Jonathan .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :1981-1985
[12]  
Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
[13]   Speech enhancement based on the general transfer function GSC and postfiltering [J].
Gannot, S ;
Cohen, I .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2004, 12 (06) :561-571
[14]  
Girin L., 2018, 2018 IEEE 28 INT WOR, P1
[15]  
Heymann J, 2016, INT CONF ACOUST SPEE, P196, DOI 10.1109/ICASSP.2016.7471664
[16]   Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR [J].
Higuchi, Takuya ;
Ito, Nobutaka ;
Araki, Shoko ;
Yoshioka, Takuya ;
Delcroix, Marc ;
Nakatani, Tomohiro .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (04) :780-793
[17]  
Hori T, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P475, DOI 10.1109/ASRU.2015.7404833
[18]   Bayesian Multichannel Audio Source Separation Based on Integrated Source and Spatial Models [J].
Itakura, Kousuke ;
Bando, Yoshiaki ;
Nakamura, Eita ;
Itoyama, Katsutoshi ;
Yoshii, Kazuyoshi ;
Kawahara, Tatsuya .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (04) :831-846
[19]  
Ito N., 2019, P IEEE INT C AC SPEE
[20]  
Itou K., 1999, Journal of the Acoustical Society of Japan (E), V20, P199, DOI 10.1250/ast.20.199