ROBUST DISCRIMINATIVE KEYWORD SPOTTING FOR EMOTIONALLY COLORED SPONTANEOUS SPEECH USING BIDIRECTIONAL LSTM NETWORKS

被引:31
作者
Woellmer, Martin [1 ]
Eyben, Florian [1 ]
Keshet, Joseph [2 ]
Graves, Alex [3 ]
Schuller, Bjoern [1 ]
Rigoll, Gerhard [1 ]
机构
[1] Tech Univ Munich, Inst Human Machine Commun, D-8000 Munich, Germany
[2] Idiap Res Inst, Martigny, Switzerland
[3] Tech Univ Munich, Inst Comp Sci 6, D-80290 Munich, Germany
来源
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS | 2009年
关键词
Speech recognition; Robustness; Recurrent neural networks;
D O I
10.1109/ICASSP.2009.4960492
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper we propose a new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding. Our approach overcomes the drawbacks of generative HMM modeling by applying a discriminative learning procedure that non-linearly maps speech features into an abstract vector space. By incorporating the outputs of a BLSTM network into the speech features, it is able to make use of past and future context for phoneme predictions. The robustness of the approach is evaluated on a keyword spotting task using the HUMAINE Sensitive Artificial Listener (SAL) database, which contains accented, spontaneous, and emotionally colored speech. The test is particularly stringent because the system is not trained on the SAL database, but only on the TIMIT corpus of read speech. We show that our method prevails over a discriminative keyword spotter without BLSTM-enhanced feature functions, which in turn has been proven to outperform HMM-based techniques.
引用
收藏
页码:3949 / +
页数:2
相关论文
共 13 条
[1]  
[Anonymous], J MACHINE LEARNING R
[2]  
AYED YB, 2004, P INT C AUD SPEECH S
[3]  
BALDI P, 1999, BIOINF BIOINFORMATIC, V15
[4]  
DEKEL O, 2004, WORKSH MULT INT REL, P146
[5]  
Douglas-Cowie E, 2007, LECT NOTES COMPUT SC, V4738, P488
[6]  
Fernández S, 2007, LECT NOTES COMPUT SC, V4669, P220
[7]   Framewise phoneme classification with bidirectional LSTM and other neural network architectures [J].
Graves, A ;
Schmidhuber, J .
NEURAL NETWORKS, 2005, 18 (5-6) :602-610
[8]  
Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]
[9]  
Keshet J., 2007, THESIS HEBREW U
[10]  
KESHET J, 2007, WORKSH NONL SPEECH P