Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

被引：50

作者：

Lu, Cheng ^{[1
]}

Zheng, Wenming ^{[2
]}

Li, Chaolong ^{[3
]}

Tang, Chuangao ^{[3
]}

Liu, Suyuan ^{[3
]}

Yan, Simeng ^{[3
]}

Zong, Yuan ^{[3
]}

机构：

[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Jiangsu, Peoples R China

[2] Southeast Univ, Sch Biol Sci & Med Engn, Minist Educ, Key Lab Child Dev & Learning Sci, Nanjing, Jiangsu, Peoples R China

[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing, Jiangsu, Peoples R China

来源：

ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2018年

基金：

中国国家自然科学基金;

关键词：

Emotion Recognition; Spatio-Temporal Information; Convolutional Neural Networks (CNN); Long Short-Term Memory (LSTM); 3D Convolutional Neural Networks (3D CNN); CLASSIFICATION;

D O I：

10.1145/3242969.3264992

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatiotemporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.

引用

页码：646 / 652

页数：7

共 50 条

[21] Learning Dynamic Spatio-Temporal Relations for Human Activity Recognition
Liu, Zhenyu
Yao, Yaqiang
Liu, Yan
Zhu, Yuening
Tao, Zhenchao
Wang, Lei
Feng, Yuhong
IEEE ACCESS, 2020, 8 : 130340 - 130352
[22] Video modeling and learning on Riemannian manifold for emotion recognition in the wild
Mengyi Liu
Ruiping Wang
Shaoxin Li
Zhiwu Huang
Shiguang Shan
Xilin Chen
Journal on Multimodal User Interfaces, 2016, 10 : 113 - 124
[23] Multi-Objective Based Spatio-Temporal Feature Representation Learning Robust to Expression Intensity Variations for Facial Expression Recognition
Kim, Dae Hoe
Baddar, Wissam J.
Jang, Jinhyeok
Ro, Yong Man
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (02) : 223 - 236
[24] Video-based Skeletal Feature Extraction for Hand Gesture Recognition
Lim, Kim Chwee
Sin, Swee Heng
Lee, Chien Wei
Chin, Weng Khin
Lin, Junliang
Nguyen, Khang
Nguyen, Quang H.
Nguyen, Binh P.
Chua, Matthew
ICMLSC 2020: PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND SOFT COMPUTING, 2020, : 108 - 112
[25] Video-Audio Emotion Recognition Based on Feature Fusion Deep Learning Method
Song, Yanan
Cai, Yuanyang
Tan, Lizhe
2021 IEEE INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS (MWSCAS), 2021, : 611 - 616
[26] A Deep Feature based Multi-kernel Learning Approach for Video Emotion Recognition
Li, Wei
Abtahi, Farnaz
Zhu, Zhigang
ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2015, : 482 - 489
[27] Multimodal emotion recognition based on feature selection and extreme learning machine in video clips
Bei Pan
Kaoru Hirota
Zhiyang Jia
Linhui Zhao
Xiaoming Jin
Yaping Dai
Journal of Ambient Intelligence and Humanized Computing, 2023, 14 : 1903 - 1917
[28] NeuroSense: Short-term emotion recognition and understanding based on spiking neural network modelling of spatio-temporal EEG patterns
Tan, Clarence
Sarlija, Marko
Kasabov, Nikola
NEUROCOMPUTING, 2021, 434 : 137 - 148
[29] EEG-based multi-frequency band functional connectivity analysis and the application of spatio-temporal features in emotion recognition
Zhang, Yuchan
Yan, Guanghui
Chang, Wenwen
Huang, Wenqie
Yuan, Yueting
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 79
[30] Spatio-Temporal Representation of an Electoencephalogram for Emotion Recognition Using a Three-Dimensional Convolutional Neural Network
Cho, Jungchan
Hwang, Hyoseok
SENSORS, 2020, 20 (12) : 1 - 18

← 1 2 3 4 5 →