Emotion Recognition in Speech with Latent Discriminative Representations Learning

被引:13
作者
Han, Jing [1 ]
Zhang, Zixing [2 ]
Keren, Gil [1 ]
Schuller, Bjorn [1 ,2 ]
机构
[1] Univ Augsburg, ZDB Chair Embedded Intelligence Hlth Care & Wellb, Augsburg, Germany
[2] Imperial Coll London, Grp Language Audio & Mus, London, England
基金
欧盟地平线“2020”;
关键词
D O I
10.3813/AAA.919214
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Despite significant recent advances in the field of affective computing, learning meaningful representations for emotion recognition remains quite challenging. In this paper, we propose a novel feature learning approach named Latent Discriminative Representation (LDR) learning for speech emotion recognition. Unlike most existing hand-crafted features designed for specific applications or features learnt by a standard neural network, the proposed learning method incorporates an additional training objective in order to learn better representations of the task of interest. To this end, we group the training samples into sets of triplets, satisfying that the second member in each triplet comes from the same class as the first and that the third member comes from a different class than the first. In the training process, we maximise the distance of the samples from different classes in the latent representation space, while we minimise the distance for samples from the same class. To evaluate the effectiveness of LDR, we perform extensive experiments on the widely used database IEMOCAP, and find that the LDR improves performance over the standard neural network training procedure. (C) 2018 The Author(s). Published by S. Hirzel Verlag . EAA.
引用
收藏
页码:737 / 740
页数:4
相关论文
共 13 条
[1]   Machine Listening for Park Soundscape Quality Assessment [J].
Boes, Michiel ;
Filipan, Karlo ;
De Coensel, Bert ;
Botteldooren, Dick .
ACTA ACUSTICA UNITED WITH ACUSTICA, 2018, 104 (01) :121-130
[2]  
Bredin H, 2017, INT CONF ACOUST SPEE, P5430, DOI 10.1109/ICASSP.2017.7953194
[3]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[4]   The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing [J].
Eyben, Florian ;
Scherer, Klaus R. ;
Schuller, Bjoern W. ;
Sundberg, Johan ;
Andre, Elisabeth ;
Busso, Carlos ;
Devillers, Laurence Y. ;
Epps, Julien ;
Laukka, Petri ;
Narayanan, Shrikanth S. ;
Truong, Khiet P. .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) :190-202
[5]   Strength modelling for real-world automatic continuous affect recognition from audiovisual signals [J].
Han, Jing ;
Zhang, Zixing ;
Cummins, Nicholas ;
Ringeval, Fabien ;
Schuller, Bjoern .
IMAGE AND VISION COMPUTING, 2017, 65 :76-86
[6]   Deep Metric Learning Using Triplet Network [J].
Hoffer, Elad ;
Ailon, Nir .
SIMILARITY-BASED PATTERN RECOGNITION, SIMBAD 2015, 2015, 9370 :84-92
[7]  
Huang PS, 2013, PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), P2333
[8]  
Kim Y, 2015, INT CONF AFFECT, P553, DOI 10.1109/ACII.2015.7344624
[9]   A Triplet Ranking-based Neural Network for Speaker Diarization and Linking [J].
Le Lan, Gael ;
Charlet, Delphine ;
Larcher, Anthony ;
Meignier, Sylvain .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3572-3576
[10]   Recognition of Emotional Vocalizations of Canine [J].
Maskeliunas, Rytis ;
Raudonis, Vidas ;
Damasevicius, Robertas .
ACTA ACUSTICA UNITED WITH ACUSTICA, 2018, 104 (02) :304-314