A unified DNN approach to speaker-dependent simultaneous speech enhancement and speech separation in low SNR environments

被引:19
作者
Gao, Tian [1 ]
Du, Jun [1 ]
Dai, Li-Rong [1 ]
Lee, Chin-Hui [2 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Anhui, Peoples R China
[2] Georgia Inst Technol, Atlanta, GA 30332 USA
基金
中国国家自然科学基金;
关键词
Speaker-dependent speech processing; Speech enhancement; Speech separation; Deep neural network; Low SNR; NEURAL-NETWORKS; DEEP; ALGORITHM; NOISE;
D O I
10.1016/j.specom.2017.10.003
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a unified speech enhancement framework to jointly handle both background noise and interfering speech in a speaker-dependent scenario based on deep neural networks (DNNs). We first explore speaker-dependent speech enhancement that can significantly improve system performance over speaker-independent systems. Next, we consider interfering speech as one noise type, thus a speaker-dependent DNN system can be adopted for both speech enhancement and separation. Experimental results demonstrate that the proposed unified system can achieve comparable performances to specific systems where only noise or speech interference is present. Furthermore, much better results can be obtained over individual enhancement or separation systems in mixed background noise and interfering speech scenarios. The training data for the two specific tasks are also found to be complementary. Finally, an ensemble learning-based framework is employed to further improve the system performance in low signal-to-noise ratio (SNR) environments. A voice activity detection (VAD) DNN and an ideal ratio mask (IRM) DNN are investigated to provide prior information to integrate two sub-modules at frame level and time-frequency level, respectively. The results demonstrate the effectiveness of the ensemble method in low SNR environments.
引用
收藏
页码:28 / 39
页数:12
相关论文
共 56 条
[1]   UNIFIED APPROACH TO SHORT-TIME FOURIER-ANALYSIS AND SYNTHESIS [J].
ALLEN, JB ;
RABINER, LR .
PROCEEDINGS OF THE IEEE, 1977, 65 (11) :1558-1564
[2]  
[Anonymous], 2013, COMPUT REV
[3]  
[Anonymous], INT C LAT VAR ANAL
[4]  
[Anonymous], 2005, Speech Enhancement
[5]  
[Anonymous], 2011, WORKSH AUT SPEECH RE
[6]   SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].
BOLL, SF .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120
[7]   Speech enhancement for non-stationary noise environments [J].
Cohen, I ;
Berdugo, B .
SIGNAL PROCESSING, 2001, 81 (11) :2403-2418
[8]   Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].
Dahl, George E. ;
Yu, Dong ;
Deng, Li ;
Acero, Alex .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42
[9]   A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks [J].
Du, Jun ;
Tu, Yanhui ;
Dai, Li-Rong ;
Lee, Chin-Hui .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (08) :1424-1437
[10]  
Du J, 2014, INT CONF SIGN PROCES, P473, DOI 10.1109/ICOSP.2014.7015050