Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN

被引:76
作者
Yao, Zengwei [1 ]
Wang, Zihao [1 ]
Liu, Weihuang [1 ]
Liu, Yaqian [1 ]
Pan, Jiahui [1 ]
机构
[1] South China Normal Univ, Sch Software, Guangzhou 510641, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech emotion recognition; Attention mechanism; Multi-task learning; Classifier fusion;
D O I
10.1016/j.specom.2020.03.005
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech emotion recognition plays an increasingly important role in emotional computing and is still a challenging task due to its complexity. In this study, we developed a framework integrating three distinctive classifiers: a deep neural network (DNN), a convolution neural network (CNN), and a recurrent neural network (RNN). The framework was used for categorical recognition of four discrete emotions (i.e., angry, happy, neutral and sad). Frame-level low-level descriptors (LLDs), segment-level mel-spectrograms (MS), and utterance-level outputs of high-level statistical functions (HSFs) on LLDs were passed to RNN, CNN, and DNN, separately. Three individual models of LLD-RNN, MS-CNN, and HSF-DNN were obtained. In the models of MS-CNN and LLD-RNN, the attention mechanism based weighted-pooling method was utilized to aggregate the CNN and RNN outputs. To effectively utilize the interdependencies between the two approaches of emotion description (discrete emotion categories and continuous emotion attributes), a multi-task learning strategy was implemented in these three models to acquire generalized features by simultaneously operating classification of discrete categories and regression of continuous attributes. Finally, a confidence-based fusion strategy was developed to integrate the power of different classifiers in recognizing different emotional states. Three experiments on emotion recognition based on the IEMOCAP corpus were conducted. Our experimental results show that the weighted pooling method based on attention mechanism endowed the neural networks with the capability to focus on emotionally salient parts. The generalized features learned in the multi-task learning helped the neural networks to achieve higher accuracies in the tasks of emotion classification. Furthermore, our proposed fusion system achieved weighted accuracy of 57.1% and unweighted accuracy of 58.3%, which were significantly higher than those of each individual classifier. The effectiveness of the proposed approach based on classifier fusion was thus validated.
引用
收藏
页码:11 / 19
页数:9
相关论文
共 37 条
[1]  
[Anonymous], [No title captured]
[2]  
[Anonymous], 2015, P 16 ANN C INT SPEEC
[3]  
Atmaja BT, 2019, 2019 IEEE INTERNATIONAL CONFERENCE ON SIGNALS AND SYSTEMS (ICSIGSYS), P40, DOI [10.1109/icsigsys.2019.8811080, 10.1109/ICSIGSYS.2019.8811080]
[4]   Automatic music transcription: challenges and future directions [J].
Benetos, Emmanouil ;
Dixon, Simon ;
Giannoulis, Dimitrios ;
Kirchhoff, Holger ;
Klapuri, Anssi .
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2013, 41 (03) :407-434
[5]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[6]  
Dai DY, 2019, INT CONF ACOUST SPEE, P7405, DOI [10.1109/ICASSP.2019.8683765, 10.1109/icassp.2019.8683765]
[7]  
Ding N, 2012, INT CONF ACOUST SPEE, P5101, DOI 10.1109/ICASSP.2012.6289068
[8]  
Eyben F., 2010, P ACM INT C MULT, P1459
[9]   Acoustical properties of speech as indicators of depression and suicidal risk [J].
France, DJ ;
Shiavi, RG ;
Silverman, S ;
Silverman, M ;
Wilkes, DM .
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 2000, 47 (07) :829-837
[10]  
Kim S, 2017, INT CONF ACOUST SPEE, P4835, DOI 10.1109/ICASSP.2017.7953075