Video Emotion Recognition in the Wild Based on Fusion of Multimodal Features

被引:12
作者
Chen, Shizhe [1 ]
Li, Xinrui [1 ]
Jin, Qin [1 ]
Zhang, Shilei [2 ]
Qin, Yong [2 ]
机构
[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China
[2] IBM Res Lab, Beijing, Peoples R China
来源
ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2016年
关键词
Video Emotion Recognition; Multimodal Features; CNN; Late Fusion;
D O I
10.1145/2993148.2997629
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present our methods to the Audio-Video Based Emotion Recognition subtask in the 2016 Emotion Recognition in the Wild (EmotiW) Challenge. The task is to predict one of the seven basic emotions for the characters in the video clips extracted from movies or TV shows. In our approach, we explore various multimodal features from audio, facial image and video motion modalities. The audio features contain statistical acoustic features, MFCC Bagof-Audio-Words and MFCC Fisher Vectors. For image related features, we extract hand-crafted features (LBP-TOP and SPM Dense SIFT) and learned features (CNN features). The improved Dense Trajectory is used as the motion related features. We train SVM, Random Forest and Logistic Regression classifiers for each kind of feature. Among them, MFCC fisher vector is the best acoustic features and the facial CNN feature is the most discriminative feature for emotion recognition. We utilize late fusion to combine different modality features and achieve a 50.76% accuracy on the testing set, which significantly outperforms the baseline test accuracy of 40.47%.
引用
收藏
页码:494 / 500
页数:7
相关论文
共 29 条
  • [1] [Anonymous], REPRESENTATION PRACT
  • [2] [Anonymous], 2015, BMVC 2015
  • [3] [Anonymous], 1995, CONVOLUTIONAL NETWOR
  • [4] [Anonymous], 2006, INT WORKSOP DYNAMICA
  • [5] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [6] Quantification of Cinematography Semiotics for Video-based Facial Emotion Recognition in the EmotiW 2015 Grand Challenge
    Cruz, Albert C.
    [J]. ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2015, : 511 - 518
  • [7] Csurka G., 2004, WORKSH STAT LEARN CO, V1, P1, DOI DOI 10.1234/12345678
  • [8] COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES
    DAVIS, SB
    MERMELSTEIN, P
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04): : 357 - 366
  • [9] Video and Image based Emotion Recognition Challenges in the Wild: EmotiW 2015
    Dhall, Abhinav
    Murthy, O. V. Ramana
    Goecke, Roland
    Joshi, Jyoti
    Gedeon, Tom
    [J]. ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2015, : 423 - 426
  • [10] Dhall Abhinav, 2016, ACM ICMI