Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening

被引:107
作者
Woellmer, Martin [1 ]
Schuller, Bjoern [1 ]
Eyben, Florian [1 ]
Rigoll, Gerhard [1 ]
机构
[1] Tech Univ Munich, Inst Human Machine Commun, D-80333 Munich, Germany
关键词
Dynamic Bayesian networks (DBNs); emotion recognition; intelligent environments; long short-term memory (LSTM); recurrent neural nets; virtual agents; RECOGNITION; ANNOTATION; SPEECH; TIME;
D O I
10.1109/JSTSP.2010.2057200
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The automatic estimation of human affect from the speech signal is an important step towards making virtual agents more natural and human-like. In this paper, we present a novel technique for incremental recognition of the user's emotional state as it is applied in a sensitive artificial listener (SAL) system designed for socially competent human-machine communication. Our method is capable of using acoustic, linguistic, as well as long-range contextual information in order to continuously predict the current quadrant in a two-dimensional emotional space spanned by the dimensions valence and activation. The main system components are a hierarchical dynamic Bayesian network (DBN) for detecting linguistic keyword features and long short-term memory (LSTM) recurrent neural networks which model phoneme context and emotional history to predict the affective state of the user. Experimental evaluations on the SAL corpus of non-prototypical real-life emotional speech data consider a number of variants of our recognition framework: continuous emotion estimation from low-level feature frames is evaluated as a new alternative to the common approach of computing statistical functionals of given speech turns. Further performance gains are achieved by discriminatively training LSTM networks and by using bidirectional context information, leading to a quadrant prediction F1-measure of up to 51.3 %, which is only 7.6 % below the average inter-labeler consistency.
引用
收藏
页码:867 / 881
页数:15
相关论文
共 87 条
[61]   Brute-forcing hierarchical functionals for paralinguistics:: A waste of feature space? [J].
Schuller, Bjoern ;
Wimmer, Matthias ;
Moesenlechner, Lorenz ;
Kern, Christian ;
Arsic, Dejan ;
Rigoll, Gerhard .
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, :4501-+
[62]   Comparing one and two-stage acoustic modeling in the recognition of emotion in speech [J].
Schuller, Bjoern ;
Vlasenko, Bogdan ;
Minguez, Ricardo ;
Rigoll, Gerhard ;
Wendemuth, Andreas .
2007 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, VOLS 1 AND 2, 2007, :596-+
[63]  
Schuller B, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P1963
[64]  
Schuller B, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P336
[65]   Being bored? Recognising natural interest by extensive audiovisual integration for real-life application [J].
Schuller, Bjoern ;
Mueller, Ronald ;
Eyben, Florian ;
Gast, Juergen ;
Hoernler, Benedikt ;
Woellmer, Martin ;
Rigoll, Gerhard ;
Hoethker, Anja ;
Konosu, Hitoshi .
IMAGE AND VISION COMPUTING, 2009, 27 (12) :1760-1774
[66]   EMOTION RECOGNITION FROM SPEECH: PUTTING ASR IN THE LOOP [J].
Schuller, Bjoern ;
Batliner, Anton ;
Steidl, Stefan ;
Seppi, Dino .
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, :4585-+
[67]  
Schuller B, 2007, ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, P30
[68]   Evolutionary feature generation in speech emotion recognition [J].
Schuller, Bjorn ;
Reiter, Stephan ;
Rigoll, Gerhard .
2006 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO - ICME 2006, VOLS 1-5, PROCEEDINGS, 2006, :5-+
[69]  
Schultz DE, 2008, MARK MANAG, V17, P8
[70]   Bidirectional recurrent neural networks [J].
Schuster, M ;
Paliwal, KK .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 1997, 45 (11) :2673-2681