Learning Transferable Features for Speech Emotion Recognition

被引:20
作者
Marczewski, Alison [1 ,2 ]
Veloso, Adriano [1 ]
Ziviani, Nivio [1 ,2 ]
机构
[1] Univ Fed Minas Gerais, CS Dept, Belo Horizonte, MG, Brazil
[2] Kunumi, Belo Horizonte, MG, Brazil
来源
PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17) | 2017年
关键词
Affective Computing; Emotion Recognition; Deep Learning; AUTOENCODER; VOICE; CLASSIFICATION; EXPRESSION; CORPUS; STATES;
D O I
10.1145/3126686.3126735
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Emotion recognition from speech is one of the key steps towards emotional intelligence in advanced human-machine interaction. Identifying emotions in human speech requires learning features that are robust and discriminative across diverse domains that differ in terms of language, spontaneity of speech, recording conditions, and types of emotions. This corresponds to a learning scenario in which the joint distributions of features and labels may change substantially across domains. In this paper, we propose a deep architecture that jointly exploits a convolutional network for extracting domain-shared features and a long short-term memory network for classifying emotions using domain-specific features. We use transferable features to enable model adaptation from multiple source domains, given the sparseness of speech emotion data and the fact that target domains are short of labeled data. A comprehensive cross -corpora experiment with diverse speech emotion domains reveals that transferable features provide gains ranging from 4.3% to 18.4% in speech emotion recognition. We evaluate several domain adaptation approaches, and we perform an ablation study to understand which source domains add the most to the overall recognition effectiveness for a given target domain.
引用
收藏
页码:529 / 536
页数:8
相关论文
共 51 条
[1]  
Abdelwahab M, 2015, INT CONF ACOUST SPEE, P5058, DOI 10.1109/ICASSP.2015.7178934
[2]   Emotion Recognition From Expressions in Face, Voice, and Body: The Multimodal Emotion Recognition Test (MERT) [J].
Baenziger, Tanja ;
Grandjean, Didier ;
Scherer, Klaus R. .
EMOTION, 2009, 9 (05) :691-704
[3]   Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech [J].
Batliner, Anton ;
Steidl, Stefan ;
Schuller, Bjoern ;
Seppi, Dino ;
Vogt, Thurid ;
Wagner, Johannes ;
Devillers, Laurence ;
Vidrascu, Laurence ;
Aharonson, Vered ;
Kessous, Loic ;
Amir, Noam .
COMPUTER SPEECH AND LANGUAGE, 2011, 25 (01) :4-28
[4]   A theory of learning from different domains [J].
Ben-David, Shai ;
Blitzer, John ;
Crammer, Koby ;
Kulesza, Alex ;
Pereira, Fernando ;
Vaughan, Jennifer Wortman .
MACHINE LEARNING, 2010, 79 (1-2) :151-175
[5]  
Burkhardt F, 2005, EUR C SPEECH COMM TE, DOI DOI 10.21437/INTERSPEECH.2005-446
[6]   MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception [J].
Busso, Carlos ;
Parthasarathy, Srinivas ;
Burmania, Alec ;
AbdelWahab, Mohammed ;
Sadoughi, Najmeh ;
Provost, Emily Mower .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2017, 8 (01) :67-80
[7]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[8]  
Chollet Francois., 2015, Keras
[9]  
Costantini G, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3501
[10]   Describing the emotional states that are expressed in speech [J].
Cowie, R ;
Cornelius, RR .
SPEECH COMMUNICATION, 2003, 40 (1-2) :5-32