Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features

被引:45
作者
Hao M. [1 ,2 ]
Cao W.-H. [1 ,2 ]
Liu Z.-T. [1 ,2 ]
Wu M. [1 ,2 ]
Xiao P. [1 ,2 ]
机构
[1] School of Automation, China University of Geosciences, Wuhan
[2] Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan
来源
Cao, Wei-Hua (weihuacao@cug.edu.cn) | 1600年 / Elsevier B.V., Netherlands卷 / 391期
基金
中国国家自然科学基金;
关键词
Ensemble learning; Multi-task learning; Multiple features; Visual-audio emotion recognition;
D O I
10.1016/j.neucom.2020.01.048
中图分类号
学科分类号
摘要
An ensemble visual-audio emotion recognition framework is proposed based on multi-task and blending learning with multiple features in this paper. To solve the problem that existing features can not accurately identify different emotions, we extract two kinds features, i. e., Interspeech 2010 and deep features for audio data, LBP and deep features for visual data, with the intent to accurately identify different emotions by using different features. Owing to the diversity of features, SVM classifiers and CNN are designed for manual features, i.e., Interspeech 2010 features and local LBP features, and deep features, through which four sub-models are obtained. Finally, the blending ensemble algorithm is used to fuse sub-models to improve the recognition performance of visual-audio emotion recognition. In addition, multi-task learning is applied in the CNN model for deep features, which can predict multiple tasks at the same time with fewer parameters and improve the sensitivity of the single recognition model to user's emotion by sharing information between different tasks. Experiments are performed using eNTERFACCE database, from which the results indicate that the recognition of multi-task CNN increased by 3% and 2% on average over CNN model in speaker-independent and speaker-dependent experiments, respectively. And emotion recognition accuracy of visual-audio by our method reaches 81.36% and 78.42% in speaker-independent and speaker-dependent experiments, respectively, which maintain higher performance than some state-of-the-art works. © 2020
引用
收藏
页码:42 / 51
页数:9
相关论文
共 49 条
  • [1] Salovey P., Mayer J.D., Emotional intelligence, Imagin. Cognit. Personal., 9, 3, pp. 185-211, (1990)
  • [2] Zeng Z., Pantic M., Roisman G.I., Et al., A survey of affect recognition methods: audio, visual, and spontaneous expressions, IEEE Trans. Pattern Anal. Mach. Intell., 31, 1, pp. 39-58, (2009)
  • [3] Liu Z.T., Wu M., Cao W.H., Et al., Speech emotion recognition based on feature selection and extreme learning machine decision tree, Neurocomputing, 273, pp. 271-280, (2018)
  • [4] Ricciardi L., Viscocomandini F., Erro R., Et al., Facial emotion recognition and expression in parkinson's disease: an emotional mirror mechanism, PLoS ONE, 12, 1, (2017)
  • [5] Liu Z.T., Xie Q., Wu M., Et al., Speech emotion recognition based on an improved brain emotion learning model, Neurocomputing, 309, pp. 145-156, (2018)
  • [6] Noroozi F., Sapinski T., Kaminska D., Et al., Vocal-based emotion recognition using random forests and decision tree, Int. J. Speech Technol., 20, 2, pp. 239-246, (2017)
  • [7] Noroozi F., Marjanovic M., Njegus A., Et al., Audio-visual emotion recognition in video clips, IEEE Trans. Affect. Comput., 10, 1, pp. 60-75, (2019)
  • [8] Zhang S.Q., Li L.M., Zhao Z.J., Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech, (2012)
  • [9] Wang Y., Guan L., Venetsanopoulos A.N., Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition, IEEE Trans. Multimedia, 14, 3, pp. 597-607, (2012)
  • [10] Zhang S.Q., Zhang S.L., Huang T.J., Et al., Learning affective features with a hybrid deep model for audio-visual emotion recognition, IEEE Trans. Circuits Syst. Video Technol., 28, 10, pp. 3030-3043, (2018)