Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

被引:11
作者
Noh, Kyoung Ju [1 ]
Jeong, Chi Yoon [1 ]
Lim, Jiyoun [1 ]
Chung, Seungeun [1 ]
Kim, Gague [1 ]
Lim, Jeong Mook [1 ]
Jeong, Hyuntae [1 ]
机构
[1] Elect & Telecommun Res Inst, Artificial Intelligence Res Lab, Daejeon 34129, South Korea
关键词
speech emotion recognition; domain adaptation; SER generalization; Korean Emotional Speech Database; ensemble model; multi-path; group-loss; BLSTM network; DATA AUGMENTATION; NEURAL-NETWORKS; FEATURES;
D O I
10.3390/s21051579
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.
引用
收藏
页码:1 / 18
页数:18
相关论文
共 54 条
[1]  
Abdelwahab M, 2019, INT CONF AFFECT, DOI [10.1109/ACII.2019.8925524, 10.1109/acii.2019.8925524]
[2]  
Abdelwahab M, 2015, INT CONF ACOUST SPEE, P5058, DOI 10.1109/ICASSP.2015.7178934
[3]   Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers [J].
Akcay, Mehmet Berkehan ;
Oguz, Kaya .
SPEECH COMMUNICATION, 2020, 116 :56-76
[4]  
[Anonymous], 2013, Emotion in the Human Face: Guidelines for Research and an Integration of Findings
[5]   Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features [J].
Anvarjon, Tursunov ;
Mustaqeem ;
Kwon, Soonil .
SENSORS, 2020, 20 (18) :1-16
[6]   Adaptive Data Boosting Technique for Robust Personalized Speech Emotion in Emotionally-Imbalanced Small-Sample Environments [J].
Bang, Jaehun ;
Hur, Taeho ;
Kim, Dohyeong ;
Huynh-The, Thien ;
Lee, Jongwon ;
Han, Yongkoo ;
Banos, Oresti ;
Kim, Jee-In ;
Lee, Sungyoung .
SENSORS, 2018, 18 (11)
[7]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[8]   Data Augmentation using GANs for Speech Emotion Recognition [J].
Chatziagapi, Aggelina ;
Paraskevopoulos, Georgios ;
Sgouropoulos, Dimitris ;
Pantazopoulos, Georgios ;
Nikandrou, Malvina ;
Giannakopoulos, Theodoros ;
Katsamanis, Athanasios ;
Potamianos, Alexandros ;
Narayanan, Shrikanth .
INTERSPEECH 2019, 2019, :171-175
[9]   3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition [J].
Chen, Mingyi ;
He, Xuanji ;
Yang, Jing ;
Zhang, Han .
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) :1440-1444
[10]  
Chorowski J, 2015, ADV NEUR IN, V28