Continuous Emotion Recognition with Spatiotemporal Convolutional Neural Networks

被引：5

作者：

Teixeira, Thomas ^{[1
]}

Granger, Eric ^{[1
]}

Lameiras Koerich, Alessandro ^{[1
]}

机构：

[1] Univ Quebec, Ecole Technol Super, 1100 Rue Notre Dame Ouest, Montreal, PQ H3C 1K3, Canada

来源：

APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 24期

基金：

加拿大自然科学与工程研究理事会;

关键词：

facial expression recognition; deep learning; convolutional recurrent neural networks; inflated 3D CNNs; dimensional emotion representation; long short-term memory; FACIAL EXPRESSIONS; DEEP; FEATURES; IMAGE; FACE;

D O I：

10.3390/app112411738

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Facial expressions are one of the most powerful ways to depict specific patterns in human behavior and describe the human emotional state. However, despite the impressive advances of affective computing over the last decade, automatic video-based systems for facial expression recognition still cannot correctly handle variations in facial expression among individuals as well as cross-cultural and demographic aspects. Nevertheless, recognizing facial expressions is a difficult task, even for humans. This paper investigates the suitability of state-of-the-art deep learning architectures based on convolutional neural networks (CNNs) to deal with long video sequences captured in the wild for continuous emotion recognition. For such an aim, several 2D CNN models that were designed to model spatial information are extended to allow spatiotemporal representation learning from videos, considering a complex and multi-dimensional emotion space, where continuous values of valence and arousal must be predicted. We have developed and evaluated convolutional recurrent neural networks, combining 2D CNNs and long short term-memory units and inflated 3D CNN models, which are built by inflating the weights of a pre-trained 2D CNN model during fine-tuning, using application-specific videos. Experimental results on the challenging SEWA-DB dataset have shown that these architectures can effectively be fine-tuned to encode spatiotemporal information from successive raw pixel images and achieve state-of-the-art results on such a dataset.

引用

页数：21

共 75 条

[41] ImageNet Classification with Deep Convolutional Neural Networks [J].