Improvement of Mask-Based Speech Source Separation Using DNN

被引:0
作者
Zhan, Ge [1 ]
Huang, Zhaoqiong [1 ]
Ying, Dongwen [1 ]
Pan, Jielin [1 ]
Yan, Yonghong [1 ]
机构
[1] Chinese Acad Sci, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China
来源
2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2016年
基金
中国国家自然科学基金;
关键词
Spectrographic speech mask; speech presence probability; time-frequency correlation; neighbor factor; DEEP NEURAL-NETWORKS; RECOGNITION; NOISE;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The speech mask is widely used to separate multiple speech sources, wherein the time-frequency bins are classified into clusters that correspond to each source. For each source, the separated signal consists of the components on TF bins that are dominated by this source, whereas the components on the remaining bins are completely masked. Most separation methods ignored the masked components. In fact, the masked components may contain some useful information, and the maskbased speech source separation can be improved by reconstructing the masked components. This paper proposes a postprocessing method to reconstruct the masked frequency components through a deep neural network (DNN). We construct a regression from the reliable frequency components to the masked components. After the masked-based separation, the reliable components are kept unchanged, and the masked components are reconstructed by the outputs of DNN. Experimental results confirmed that the proposed method significantly improved the mask-based separation, and that the masked components are still useful to the speech quality.
引用
收藏
页数:5
相关论文
共 16 条
[1]   Maximum a Posteriori Binary Mask Estimation for Underdetermined Source Separation Using Smoothed Posteriors [J].
Cobos, Maximo ;
Lopez, Jose J. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (07) :2059-2064
[2]  
Garofolo J., 1988, Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database
[3]  
Gemmeke J.F., 2011, NOISE ROBUST ASR MIS
[4]   Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition [J].
Gemmeke, Jort F. ;
Virtanen, Tuomas ;
Hurmalainen, Antti .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (07) :2067-2080
[5]   Deep Neural Networks for Acoustic Modeling in Speech Recognition [J].
Hinton, Geoffrey ;
Deng, Li ;
Yu, Dong ;
Dahl, George E. ;
Mohamed, Abdel-rahman ;
Jaitly, Navdeep ;
Senior, Andrew ;
Vanhoucke, Vincent ;
Patrick Nguyen ;
Sainath, Tara N. ;
Kingsbury, Brian .
IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) :82-97
[6]  
Huang Z.Q., ICASSP 2016, P3191
[7]  
Larochelle H, 2009, J MACH LEARN RES, V10, P1
[8]   A Spectral Masking Approach to Noise-Robust Speech Recognition Using Deep Neural Networks [J].
Li, Bo ;
Sim, Khe Chai .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (08) :1296-1305
[9]   Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition [J].
Narayanan, Arun ;
Wang, DeLiang .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (04) :826-835
[10]   Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment [J].
Sawada, Hiroshi ;
Araki, Shoko ;
Makino, Shoji .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (03) :516-527