Multi-cue fusion for emotion recognition in the wild

被引：67

作者：

Yan, Jingwei ^{[1
]}

Zheng, Wenming ^{[1
]}

Cui, Zhen ^{[2
]}

Tang, Chuangao ^{[1
]}

Zhang, Tong ^{[3
]}

Zong, Yuan ^{[1
]}

机构：

[1] Southeast Univ, Sch Biol Sci & Med Engn, Key Lab Child Dev & Learning Sci, Minist Educ, Nanjing 210096, Jiangsu, Peoples R China

[2] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Jiangsu, Peoples R China

[3] Southeast Univ, Sch Informat Sci & Engn, Nanjing 210096, Jiangsu, Peoples R China

来源：

NEUROCOMPUTING | 2018年 / 309卷

基金：

中国国家自然科学基金;

关键词：

Emotion recognition; Convolutional neural network (CNN); Facial landmark action; Multi-cue fusion;

D O I：

10.1016/j.neucom.2018.03.068

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotion recognition has become a hot research topic in the past several years due to the large demand of this technology in many practical situations. One challenging task in this topic is to recognize emotion types in a given video clip collected in the wild. In order to solve this problem we propose a multi-cue fusion emotion recognition (MCFER) framework by modeling human emotions from three complementary cues, i.e., facial texture, facial landmark action and audio signal, and then fusing them together. To capture the dynamic change of facial texture we employ a cascaded convolutional neutral network (CNN) and bidirectional recurrent neutral network (BRNN) architecture where facial image from each frame is first fed into CNN to extract high-level texture feature, and then the feature sequence is traversed into BRNN to learn the changes within it. Facial landmark action models the movement of facial muscles explicitly. SVM and CNN are deployed to explore the emotion related patterns in it. Audio signal is also modeled with CNN by extracting low-level acoustic features from segmented clips and then stacking them as an image-like matrix. We fuse these models at both feature level and decision level to further boost the overall performance. Experimental results on two challenging databases demonstrate the effectiveness and superiority of our proposed MCFER framework. (C) 2018 Elsevier B.V. All rights reserved.

引用

页码：27 / 35

页数：9

共 32 条

[1]

[Anonymous], ARXIV170803619

[2]

[Anonymous], 2017, ICCV

[3]

[Anonymous], 2013, P INT C LEARN REPR I

[4]

[Anonymous], PROC CVPR IEEE

[5]

[Anonymous], NEURAL COMPUT

[6]

[Anonymous], 2010, CVPR WORKSH

[7] Emotion Recognition in the Wild from Videos using Images [J].

Bargal, Sarah Adel ;

Barsoum, Emad ;

Ferrer, Cristian Canton ;

Zhang, Cha .

ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, :433-436

[8] Emotion Recognition in Videos via Fusing Multimodal Features [J].

Chen, Shizhe ;

Dian, Yujie ;

Li, Xinrui ;

Lin, Xiaozhu ;

Jin, Qin ;

Liu, Haibo ;

Lu, Li .

PATTERN RECOGNITION (CCPR 2016), PT II, 2016, 663 :632-644

[9] Recurrent Shape Regression [J].

Cui, Zhen ;

Xiao, Shengtao ;

Niu, Zhiheng ;

Yan, Shuicheng ;

Zheng, Wenming .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (05) :1271-1278

[10] EmotiW 2016: Video and Group-Level Emotion Recognition Challenges [J].

Dhall, Abhinav ;

Goecke, Roland ;

Joshi, Jyoti ;

Hoey, Jesse ;

Gedeon, Tom .

ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, :427-432

← 1 2 3 4 →