Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features

被引：45

作者：

Hao M. ^{[1
,2
]}

Cao W.-H. ^{[1
,2
]}

Liu Z.-T. ^{[1
,2
]}

Wu M. ^{[1
,2
]}

Xiao P. ^{[1
,2
]}

机构：

[1] School of Automation, China University of Geosciences, Wuhan

[2] Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan

来源：

Cao, Wei-Hua (weihuacao@cug.edu.cn) | 1600年 / Elsevier B.V., Netherlands卷 / 391期

基金：

中国国家自然科学基金;

关键词：

Ensemble learning; Multi-task learning; Multiple features; Visual-audio emotion recognition;

D O I：

10.1016/j.neucom.2020.01.048

中图分类号：

学科分类号：

摘要：

An ensemble visual-audio emotion recognition framework is proposed based on multi-task and blending learning with multiple features in this paper. To solve the problem that existing features can not accurately identify different emotions, we extract two kinds features, i. e., Interspeech 2010 and deep features for audio data, LBP and deep features for visual data, with the intent to accurately identify different emotions by using different features. Owing to the diversity of features, SVM classifiers and CNN are designed for manual features, i.e., Interspeech 2010 features and local LBP features, and deep features, through which four sub-models are obtained. Finally, the blending ensemble algorithm is used to fuse sub-models to improve the recognition performance of visual-audio emotion recognition. In addition, multi-task learning is applied in the CNN model for deep features, which can predict multiple tasks at the same time with fewer parameters and improve the sensitivity of the single recognition model to user's emotion by sharing information between different tasks. Experiments are performed using eNTERFACCE database, from which the results indicate that the recognition of multi-task CNN increased by 3% and 2% on average over CNN model in speaker-independent and speaker-dependent experiments, respectively. And emotion recognition accuracy of visual-audio by our method reaches 81.36% and 78.42% in speaker-independent and speaker-dependent experiments, respectively, which maintain higher performance than some state-of-the-art works. © 2020

引用

页码：42 / 51

页数：9

共 49 条

[1] Salovey P., Mayer J.D., Emotional intelligence, Imagin. Cognit. Personal., 9, 3, pp. 185-211, (1990)
[2] Zeng Z., Pantic M., Roisman G.I., Et al., A survey of affect recognition methods: audio, visual, and spontaneous expressions, IEEE Trans. Pattern Anal. Mach. Intell., 31, 1, pp. 39-58, (2009)
[3] Liu Z.T., Wu M., Cao W.H., Et al., Speech emotion recognition based on feature selection and extreme learning machine decision tree, Neurocomputing, 273, pp. 271-280, (2018)
[4] Ricciardi L., Viscocomandini F., Erro R., Et al., Facial emotion recognition and expression in parkinson's disease: an emotional mirror mechanism, PLoS ONE, 12, 1, (2017)
[5] Liu Z.T., Xie Q., Wu M., Et al., Speech emotion recognition based on an improved brain emotion learning model, Neurocomputing, 309, pp. 145-156, (2018)
[6] Noroozi F., Sapinski T., Kaminska D., Et al., Vocal-based emotion recognition using random forests and decision tree, Int. J. Speech Technol., 20, 2, pp. 239-246, (2017)
[7] Noroozi F., Marjanovic M., Njegus A., Et al., Audio-visual emotion recognition in video clips, IEEE Trans. Affect. Comput., 10, 1, pp. 60-75, (2019)
[8] Zhang S.Q., Li L.M., Zhao Z.J., Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech, (2012)
[9] Wang Y., Guan L., Venetsanopoulos A.N., Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition, IEEE Trans. Multimedia, 14, 3, pp. 597-607, (2012)
[10] Zhang S.Q., Zhang S.L., Huang T.J., Et al., Learning affective features with a hybrid deep model for audio-visual emotion recognition, IEEE Trans. Circuits Syst. Video Technol., 28, 10, pp. 3030-3043, (2018)

← 1 2 3 4 5 →