Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function

被引:50
作者
Huang, Jian [1 ,2 ]
Li, Ya [1 ]
Tao, Jianhua [1 ,2 ,3 ]
Lian, Zheng [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] CAS Ctr Excellence Brain Sci & Intelligence Techn, Beijing, Peoples R China
来源
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES | 2018年
基金
中国国家自然科学基金;
关键词
speech emotion recognition; triplet loss; variable-length inputs;
D O I
10.21437/Interspeech.2018-1432
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic emotion recognition is a crucial element on understanding human behavior and interaction. Prior works on speech emotion recognition focus on exploring various feature sets and models. Compared with these methods, we propose a triplet framework based on Long Short-Term Memory Neural Network (LSTM) for speech emotion recognition. The system learns a mapping from acoustic features to discriminative embedding features, which are regarded as basis of testing with SVM. The proposed model is trained with triplet loss and supervised loss simultaneously. The triplet loss makes Ultra class distance shorter and inter-class distance longer, and supervised loss incorporates class label information. In view of variable-length inputs, we explore three different strategies to handle this problem, and meanwhile make better use of temporal dynamic process information. Our experimental results on the Interactive Emotional Motion Capture (IEMOCAP) database reveal that the proposed methods are beneficial to performance improvement. We demonstrate promise of triplet framework for speech emotion recognition and present our analysis.
引用
收藏
页码:3673 / 3677
页数:5
相关论文
共 22 条
  • [1] [Anonymous], 2013, Proceedings of the 21st ACM International Conference on Multimedia, DOI DOI 10.1145/2502081.2502224
  • [2] [Anonymous], 2012, COMPUTER ENCE
  • [3] [Anonymous], ARXIV180101237
  • [4] Aytar Y., 2016, P ADV NEUR INF PROC
  • [5] Bengio Y., 2009, ICML, P41, DOI DOI 10.1145/1553374.1553380
  • [6] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [7] Chao LL, 2016, INT CONF ACOUST SPEE, P2752, DOI 10.1109/ICASSP.2016.7472178
  • [8] The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing
    Eyben, Florian
    Scherer, Klaus R.
    Schuller, Bjoern W.
    Sundberg, Johan
    Andre, Elisabeth
    Busso, Carlos
    Devillers, Laurence Y.
    Epps, Julien
    Laukka, Petri
    Narayanan, Shrikanth S.
    Truong, Khiet P.
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2016, 7 (02) : 190 - 202
  • [9] Evaluating deep learning architectures for Speech Emotion Recognition
    Fayek, Haytham M.
    Lech, Margaret
    Cavedon, Lawrence
    [J]. NEURAL NETWORKS, 2017, 92 : 60 - 68
  • [10] Categorical and dimensional affect analysis in continuous input: Current trends and future directions
    Gunes, Hatice
    Schuller, Bjoern
    [J]. IMAGE AND VISION COMPUTING, 2013, 31 (02) : 120 - 136