Multi-Task Joint-Learning for Robust Voice Activity Detection

被引:0
作者
Zhuang, Yimeng [1 ]
Tong, Sibo [1 ]
Yin, Maofan [1 ]
Qian, Yanmin [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, Key Lab Shanghai Educ Commiss Intelligent Interac, Brain Sci & Technol Res Ctr, SpeechLab,Dept Comp Sci & Engn, Shanghai, Peoples R China
来源
2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2016年
关键词
voice activity detection; multi-task learning; multi-frame predictions; deep neural networks;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Model based VAD approaches have been widely used and achieved success in practice. These approaches usually cast VAD as a frame-level classification problem and employ statistical classifiers, such as Gaussian Mixture Model (GMM) or Deep Neural Network (DNN) to assign a speech/silence label for each frame. Due to the frame independent assumption classification, the VAD results tend to be fragile. To address this problem, in this paper, a new structured multi-frame prediction DNN approach is proposed to improve the segment-level VAD performance. During DNN training, VAD labels of multiple consecutive frames are concatenated together as targets and jointly trained with a speech enhancement task to achieve robustness under noisy conditions. During testing, the VAD label for each frame is obtained by merging the prediction results from neighbouring frames. Experiments on an Aurora 4 dataset showed that, conventional DNN based VAD has poor and unstable prediction performance while the proposed multitask trained VAD is much more robust.
引用
收藏
页数:5
相关论文
共 22 条
[1]  
[Anonymous], 2013, The Cross-Entropy Method
[2]  
[Anonymous], T AM I ELECT ENG 1
[3]  
[Anonymous], 1975, SIGNAL DETECTION THE
[4]  
Breslin C., 2013, AC SPEECH SIGN PROC, P8362
[5]  
Egorova E., 2014, P 20 STUD C EEICT 20, P24
[6]  
Erhan D, 2010, J MACH LEARN RES, V11, P625
[7]  
Hughes T, 2013, INT CONF ACOUST SPEE, P7378, DOI 10.1109/ICASSP.2013.6639096
[8]  
Jaitly N., 2014, 15 ANN C INT SPEECH
[9]  
Junqua C., 1991, 2 EUR C SPEECH COMM
[10]   Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations [J].
Mesgarani, N ;
Slaney, M ;
Shamma, SA .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (03) :920-930