Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

被引:50
|
作者
Lu, Cheng [1 ]
Zheng, Wenming [2 ]
Li, Chaolong [3 ]
Tang, Chuangao [3 ]
Liu, Suyuan [3 ]
Yan, Simeng [3 ]
Zong, Yuan [3 ]
机构
[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Jiangsu, Peoples R China
[2] Southeast Univ, Sch Biol Sci & Med Engn, Minist Educ, Key Lab Child Dev & Learning Sci, Nanjing, Jiangsu, Peoples R China
[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing, Jiangsu, Peoples R China
来源
ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2018年
基金
中国国家自然科学基金;
关键词
Emotion Recognition; Spatio-Temporal Information; Convolutional Neural Networks (CNN); Long Short-Term Memory (LSTM); 3D Convolutional Neural Networks (3D CNN); CLASSIFICATION;
D O I
10.1145/3242969.3264992
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatiotemporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.
引用
收藏
页码:646 / 652
页数:7
相关论文
共 50 条
  • [41] Multi-Feature Based Emotion Recognition for Video Clips
    Liu, Chuanhe
    Tang, Tianhao
    Lv, Kui
    Wang, Minghao
    ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 630 - 634
  • [42] Video-based Emotion Recognition Using Deeply-Supervised Neural Networks
    Fan, Yingruo
    Lam, Jacqueline C. K.
    Li, Victor O. K.
    ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 584 - 588
  • [43] Emotion Recognition From Full-Body Motion Using Multiscale Spatio-Temporal Network
    Wang, Tao
    Liu, Shuang
    He, Feng
    Dai, Weina
    Du, Minghao
    Ke, Yufeng
    Ming, Dong
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (03) : 898 - 912
  • [44] Korean video dataset for emotion recognition in the wild
    Trinh Le Ba Khanh
    Soo-Hyung Kim
    Gueesang Lee
    Hyung-Jeong Yang
    Eu-Tteum Baek
    Multimedia Tools and Applications, 2021, 80 : 9479 - 9492
  • [45] Emotion Recognition from Spatio-Temporal Representation of EEG Signals via 3D-CNN with Ensemble Learning Techniques
    Yuvaraj, Rajamanickam
    Baranwal, Arapan
    Prince, A. Amalin
    Murugappan, M.
    Mohammed, Javeed Shaikh
    BRAIN SCIENCES, 2023, 13 (04)
  • [46] Korean video dataset for emotion recognition in the wild
    Khanh, Trinh Le Ba
    Kim, Soo-Hyung
    Lee, Gueesang
    Yang, Hyung-Jeong
    Baek, Eu-Tteum
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (06) : 9479 - 9492
  • [47] Spatio-temporal discrepancy feature for classification of motor imageries
    Luo, Jing
    Feng, Zuren
    Lu, Na
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2019, 47 : 137 - 144
  • [48] A Novel Feature Selection Method for Video-Based Human Activity Recognition Systems
    Siddiqi, Muhammad Hameed
    Alruwaili, Madallah
    Ali, Amjad
    IEEE ACCESS, 2019, 7 : 119593 - 119602
  • [49] State-of-the-art on spatio-temporal information-based video retrieval
    Ren, W.
    Singh, S.
    Singh, M.
    Zhu, Y. S.
    PATTERN RECOGNITION, 2009, 42 (02) : 267 - 282
  • [50] Spatio-temporal enhancement method based on dense connection structure for compressed video
    Li, Hongyao
    He, Xiaohai
    Bi, Xiaodong
    Xiong, Shuhua
    Chen, Honggang
    JOURNAL OF ELECTRONIC IMAGING, 2024, 33 (04)