Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

被引:50
|
作者
Lu, Cheng [1 ]
Zheng, Wenming [2 ]
Li, Chaolong [3 ]
Tang, Chuangao [3 ]
Liu, Suyuan [3 ]
Yan, Simeng [3 ]
Zong, Yuan [3 ]
机构
[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Jiangsu, Peoples R China
[2] Southeast Univ, Sch Biol Sci & Med Engn, Minist Educ, Key Lab Child Dev & Learning Sci, Nanjing, Jiangsu, Peoples R China
[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing, Jiangsu, Peoples R China
来源
ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2018年
基金
中国国家自然科学基金;
关键词
Emotion Recognition; Spatio-Temporal Information; Convolutional Neural Networks (CNN); Long Short-Term Memory (LSTM); 3D Convolutional Neural Networks (3D CNN); CLASSIFICATION;
D O I
10.1145/3242969.3264992
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatiotemporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.
引用
收藏
页码:646 / 652
页数:7
相关论文
共 50 条
  • [21] Learning Dynamic Spatio-Temporal Relations for Human Activity Recognition
    Liu, Zhenyu
    Yao, Yaqiang
    Liu, Yan
    Zhu, Yuening
    Tao, Zhenchao
    Wang, Lei
    Feng, Yuhong
    IEEE ACCESS, 2020, 8 : 130340 - 130352
  • [22] Video modeling and learning on Riemannian manifold for emotion recognition in the wild
    Mengyi Liu
    Ruiping Wang
    Shaoxin Li
    Zhiwu Huang
    Shiguang Shan
    Xilin Chen
    Journal on Multimodal User Interfaces, 2016, 10 : 113 - 124
  • [23] Multi-Objective Based Spatio-Temporal Feature Representation Learning Robust to Expression Intensity Variations for Facial Expression Recognition
    Kim, Dae Hoe
    Baddar, Wissam J.
    Jang, Jinhyeok
    Ro, Yong Man
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (02) : 223 - 236
  • [24] Video-based Skeletal Feature Extraction for Hand Gesture Recognition
    Lim, Kim Chwee
    Sin, Swee Heng
    Lee, Chien Wei
    Chin, Weng Khin
    Lin, Junliang
    Nguyen, Khang
    Nguyen, Quang H.
    Nguyen, Binh P.
    Chua, Matthew
    ICMLSC 2020: PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND SOFT COMPUTING, 2020, : 108 - 112
  • [25] Video-Audio Emotion Recognition Based on Feature Fusion Deep Learning Method
    Song, Yanan
    Cai, Yuanyang
    Tan, Lizhe
    2021 IEEE INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS (MWSCAS), 2021, : 611 - 616
  • [26] A Deep Feature based Multi-kernel Learning Approach for Video Emotion Recognition
    Li, Wei
    Abtahi, Farnaz
    Zhu, Zhigang
    ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2015, : 482 - 489
  • [27] Multimodal emotion recognition based on feature selection and extreme learning machine in video clips
    Bei Pan
    Kaoru Hirota
    Zhiyang Jia
    Linhui Zhao
    Xiaoming Jin
    Yaping Dai
    Journal of Ambient Intelligence and Humanized Computing, 2023, 14 : 1903 - 1917
  • [28] NeuroSense: Short-term emotion recognition and understanding based on spiking neural network modelling of spatio-temporal EEG patterns
    Tan, Clarence
    Sarlija, Marko
    Kasabov, Nikola
    NEUROCOMPUTING, 2021, 434 : 137 - 148
  • [29] EEG-based multi-frequency band functional connectivity analysis and the application of spatio-temporal features in emotion recognition
    Zhang, Yuchan
    Yan, Guanghui
    Chang, Wenwen
    Huang, Wenqie
    Yuan, Yueting
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 79
  • [30] Spatio-Temporal Representation of an Electoencephalogram for Emotion Recognition Using a Three-Dimensional Convolutional Neural Network
    Cho, Jungchan
    Hwang, Hyoseok
    SENSORS, 2020, 20 (12) : 1 - 18