Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function

被引：52

作者：

Huang, Jian ^{[1
,2
]}

Li, Ya ^{[1
]}

Tao, Jianhua ^{[1
,2
,3
]}

Lian, Zheng ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China

[3] CAS Ctr Excellence Brain Sci & Intelligence Techn, Beijing, Peoples R China

来源：

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES | 2018年

基金：

中国国家自然科学基金;

关键词：

speech emotion recognition; triplet loss; variable-length inputs;

D O I：

10.21437/Interspeech.2018-1432

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Automatic emotion recognition is a crucial element on understanding human behavior and interaction. Prior works on speech emotion recognition focus on exploring various feature sets and models. Compared with these methods, we propose a triplet framework based on Long Short-Term Memory Neural Network (LSTM) for speech emotion recognition. The system learns a mapping from acoustic features to discriminative embedding features, which are regarded as basis of testing with SVM. The proposed model is trained with triplet loss and supervised loss simultaneously. The triplet loss makes Ultra class distance shorter and inter-class distance longer, and supervised loss incorporates class label information. In view of variable-length inputs, we explore three different strategies to handle this problem, and meanwhile make better use of temporal dynamic process information. Our experimental results on the Interactive Emotional Motion Capture (IEMOCAP) database reveal that the proposed methods are beneficial to performance improvement. We demonstrate promise of triplet framework for speech emotion recognition and present our analysis.

引用

页码：3673 / 3677

页数：5

共 22 条

[1]

[Anonymous], 2013, Proceedings of the 21st ACM International Conference on Multimedia, DOI DOI 10.1145/2502081.2502224

[2]

[Anonymous], 2012, COMPUTER ENCE

[3]

[Anonymous], ARXIV180101237

[4]

Aytar Y., 2016, P ADV NEUR INF PROC

[5]

Bengio Y., 2009, ICML, P41, DOI DOI 10.1145/1553374.1553380

[6] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[7]

Chao LL, 2016, INT CONF ACOUST SPEE, P2752, DOI 10.1109/ICASSP.2016.7472178

[8] The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing [J].

Eyben, Florian ;

Scherer, Klaus R. ;

Schuller, Bjoern W. ;

Sundberg, Johan ;

Andre, Elisabeth ;

Busso, Carlos ;

Devillers, Laurence Y. ;

Epps, Julien ;

Laukka, Petri ;

Narayanan, Shrikanth S. ;

Truong, Khiet P. .

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) :190-202

[9] Evaluating deep learning architectures for Speech Emotion Recognition [J].

Fayek, Haytham M. ;

Lech, Margaret ;

Cavedon, Lawrence .

NEURAL NETWORKS, 2017, 92 :60-68

[10] Categorical and dimensional affect analysis in continuous input: Current trends and future directions [J].

Gunes, Hatice ;

Schuller, Bjoern .

IMAGE AND VISION COMPUTING, 2013, 31 (02) :120-136

← 1 2 3 →