Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild

被引：50

作者：

Lu, Cheng ^{[1
]}

Zheng, Wenming ^{[2
]}

Li, Chaolong ^{[3
]}

Tang, Chuangao ^{[3
]}

Liu, Suyuan ^{[3
]}

Yan, Simeng ^{[3
]}

Zong, Yuan ^{[3
]}

机构：

[1] Southeast Univ, Sch Informat Sci & Engn, Nanjing, Jiangsu, Peoples R China

[2] Southeast Univ, Sch Biol Sci & Med Engn, Minist Educ, Key Lab Child Dev & Learning Sci, Nanjing, Jiangsu, Peoples R China

[3] Southeast Univ, Sch Biol Sci & Med Engn, Nanjing, Jiangsu, Peoples R China

来源：

ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION | 2018年

基金：

中国国家自然科学基金;

关键词：

Emotion Recognition; Spatio-Temporal Information; Convolutional Neural Networks (CNN); Long Short-Term Memory (LSTM); 3D Convolutional Neural Networks (3D CNN); CLASSIFICATION;

D O I：

10.1145/3242969.3264992

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatiotemporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.

引用

页码：646 / 652

页数：7

共 50 条

[41] Multi-Feature Based Emotion Recognition for Video Clips
Liu, Chuanhe
Tang, Tianhao
Lv, Kui
Wang, Minghao
ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 630 - 634
[42] Video-based Emotion Recognition Using Deeply-Supervised Neural Networks
Fan, Yingruo
Lam, Jacqueline C. K.
Li, Victor O. K.
ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 584 - 588
[43] Emotion Recognition From Full-Body Motion Using Multiscale Spatio-Temporal Network
Wang, Tao
Liu, Shuang
He, Feng
Dai, Weina
Du, Minghao
Ke, Yufeng
Ming, Dong
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (03) : 898 - 912
[44] Korean video dataset for emotion recognition in the wild
Trinh Le Ba Khanh
Soo-Hyung Kim
Gueesang Lee
Hyung-Jeong Yang
Eu-Tteum Baek
Multimedia Tools and Applications, 2021, 80 : 9479 - 9492
[45] Emotion Recognition from Spatio-Temporal Representation of EEG Signals via 3D-CNN with Ensemble Learning Techniques
Yuvaraj, Rajamanickam
Baranwal, Arapan
Prince, A. Amalin
Murugappan, M.
Mohammed, Javeed Shaikh
BRAIN SCIENCES, 2023, 13 (04)
[46] Korean video dataset for emotion recognition in the wild
Khanh, Trinh Le Ba
Kim, Soo-Hyung
Lee, Gueesang
Yang, Hyung-Jeong
Baek, Eu-Tteum
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (06) : 9479 - 9492
[47] Spatio-temporal discrepancy feature for classification of motor imageries
Luo, Jing
Feng, Zuren
Lu, Na
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2019, 47 : 137 - 144
[48] A Novel Feature Selection Method for Video-Based Human Activity Recognition Systems
Siddiqi, Muhammad Hameed
Alruwaili, Madallah
Ali, Amjad
IEEE ACCESS, 2019, 7 : 119593 - 119602
[49] State-of-the-art on spatio-temporal information-based video retrieval
Ren, W.
Singh, S.
Singh, M.
Zhu, Y. S.
PATTERN RECOGNITION, 2009, 42 (02) : 267 - 282
[50] Spatio-temporal enhancement method based on dense connection structure for compressed video
Li, Hongyao
He, Xiaohai
Bi, Xiaodong
Xiong, Shuhua
Chen, Honggang
JOURNAL OF ELECTRONIC IMAGING, 2024, 33 (04)

← 1 2 3 4 5 →