Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition

被引：80

作者：

Weng, Chao ^{[1
]}

Yu, Dong ^{[1
]}

Seltzer, Michael L. ^{[1
]}

Droppo, Jasha ^{[1
]}

机构：

[1] Microsoft Res, Redmond, WA 98052 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2015年 / 23卷 / 10期

关键词：

Deep neural network (DNN); joint decoding; multi-talker automatic speech recognition (ASR); noise robustness; single-channel; weighted finite-state transducer (WFST);

D O I：

10.1109/TASLP.2015.2444659

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We investigate techniques based on deep neural networks (DNNs) for attacking the single-channel multi-talker speech recognition problem. Our proposed approach contains five key ingredients: a multi-style training strategy on artificially mixed speech data, a separate DNN to estimate senone posterior probabilities of the louder and softer speakers at each frame, a weighted finite-state transducer (WFST)-based two-talker decoder to jointly estimate and correlate the speaker and speech, a speaker switching penalty estimated from the energy pattern change in the mixed-speech, and a confidence based system combination strategy. Experiments on the 2006 speech separation and recognition challenge task demonstrate that our proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker. The best setup of our proposed systems achieves an average word error rate (WER) of 18.8% across different SNRs and outperforms the state-of-the-art IBM superhuman system by 2.8% absolute with fewer assumptions.

引用

页码：1670 / 1679

页数：10

共 36 条

[1]

Abrash V., 1995, P EUROSPEECH, P2183

[2]

[Anonymous], 2006, P INTERSPEECH

[3] Speech fragment decoding techniques for simultaneous speaker identification and speech recognition [J].

Barker, Jon ;

Ma, Ning ;

Coy, Andre ;

Cooke, Martin .

COMPUTER SPEECH AND LANGUAGE, 2010, 24 (01) :94-111

[4]

Chao Weng, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P5532, DOI 10.1109/ICASSP.2014.6854661

[5]

Chao Weng, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P5632, DOI 10.1109/ICASSP.2014.6854681

[6] An audio-visual corpus for speech perception and automatic speech recognition (L) [J].

Cooke, Martin ;

Barker, Jon ;

Cunningham, Stuart ;

Shao, Xu .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424

[7] Monaural speech separation and recognition challenge [J].

Cooke, Martin ;

Hershey, John R. ;

Rennie, Steven J. .

COMPUTER SPEECH AND LANGUAGE, 2010, 24 (01) :1-15

[8] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].

Dahl, George E. ;

Yu, Dong ;

Deng, Li ;

Acero, Alex .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42

[9]

Every M. R., 2006, P INTERSPEECH

[10] Maximum likelihood linear transformations for HMM-based speech recognition [J].

Gales, MJF .

COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02) :75-98

← 1 2 3 4 →