Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition

被引:80
作者
Weng, Chao [1 ]
Yu, Dong [1 ]
Seltzer, Michael L. [1 ]
Droppo, Jasha [1 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
关键词
Deep neural network (DNN); joint decoding; multi-talker automatic speech recognition (ASR); noise robustness; single-channel; weighted finite-state transducer (WFST);
D O I
10.1109/TASLP.2015.2444659
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We investigate techniques based on deep neural networks (DNNs) for attacking the single-channel multi-talker speech recognition problem. Our proposed approach contains five key ingredients: a multi-style training strategy on artificially mixed speech data, a separate DNN to estimate senone posterior probabilities of the louder and softer speakers at each frame, a weighted finite-state transducer (WFST)-based two-talker decoder to jointly estimate and correlate the speaker and speech, a speaker switching penalty estimated from the energy pattern change in the mixed-speech, and a confidence based system combination strategy. Experiments on the 2006 speech separation and recognition challenge task demonstrate that our proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker. The best setup of our proposed systems achieves an average word error rate (WER) of 18.8% across different SNRs and outperforms the state-of-the-art IBM superhuman system by 2.8% absolute with fewer assumptions.
引用
收藏
页码:1670 / 1679
页数:10
相关论文
共 36 条
[1]  
Abrash V., 1995, P EUROSPEECH, P2183
[2]  
[Anonymous], 2006, P INTERSPEECH
[3]   Speech fragment decoding techniques for simultaneous speaker identification and speech recognition [J].
Barker, Jon ;
Ma, Ning ;
Coy, Andre ;
Cooke, Martin .
COMPUTER SPEECH AND LANGUAGE, 2010, 24 (01) :94-111
[4]  
Chao Weng, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P5532, DOI 10.1109/ICASSP.2014.6854661
[5]  
Chao Weng, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P5632, DOI 10.1109/ICASSP.2014.6854681
[6]   An audio-visual corpus for speech perception and automatic speech recognition (L) [J].
Cooke, Martin ;
Barker, Jon ;
Cunningham, Stuart ;
Shao, Xu .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424
[7]   Monaural speech separation and recognition challenge [J].
Cooke, Martin ;
Hershey, John R. ;
Rennie, Steven J. .
COMPUTER SPEECH AND LANGUAGE, 2010, 24 (01) :1-15
[8]   Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].
Dahl, George E. ;
Yu, Dong ;
Deng, Li ;
Acero, Alex .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42
[9]  
Every M. R., 2006, P INTERSPEECH
[10]   Maximum likelihood linear transformations for HMM-based speech recognition [J].
Gales, MJF .
COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02) :75-98