Enhanced spatio-temporal 3D CNN for facial expression classification in videos

被引：2

作者：

Khanna, Deepanshu ^{[1
]}

Jindal, Neeru ^{[1
]}

Rana, Prashant Singh ^{[2
]}

Singh, Harpreet ^{[2
]}

机构：

[1] Thapar Inst Engn & Technol, Elect & Commun Engn Dept, Patiala, Punjab, India

[2] Thapar Inst Engn & Technol, Comp Sci Engn Dept, Patiala, Punjab, India

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2024年 / 83卷 / 04期

关键词：

Face expression recognition; Video pre-processing; Hybrid deep network; RECOGNITION; FACE;

D O I：

10.1007/s11042-023-16066-6

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This article proposes a hybrid network model for video-based human facial expression recognition (FER) system consisting of an end-to-end 3D deep convolutional neural networks. The proposed network combines two commonly used deep 3-dimensional Convolutional Neural Networks (3D CNN) models, ResNet-50 and DenseNet-121, in an end-to-end manner with slight modifications. Currently, various methodologies exist for FER, such as 2-dimensional Convolutional Neural Networks (2D CNN), 2D CNN-Recurrent Neural Networks, 3D CNN, and features extracting algorithms such as PCA and Histogram of oriented gradients (HOG) combined with machine learning classifiers. For the proposed model, we choose 3D CNN over other methods since they preserve temporal information of the videos, unlike 2D CNN. Moreover, these aren't labor-intensive such as various handcrafted feature extracting methods. The proposed system relies on the temporal averaging of information from frame sequences of the video. The databases are pre-processed to remove unwanted backgrounds for training 3D deep CNN from scratch. Initially, feature vectors from video frame sequences are extracted using the 3D ResNet model. These feature vectors are fed to the 3D DenseNet model's blocks, which are then used to classify the predicted emotion. The model is evaluated on three benchmarking databases: Ravdess, CK + , and BAUM1s, which achieved 91.69%, 98.61%, and 73.73% accuracy for the respective databases and outperformed various existing methods. We prove that the proposed architecture works well even for the classes with less amount of training data where many existing 3D CNN networks fail.

引用

页码：9911 / 9928

页数：18

共 38 条

[31] Scovanner P., 2007, ACM MM, P357
[32] Automatic Facial Expression Recognition Using Combined Geometric Features
Sharma, Garima
Singh, Latika
Gautam, Sumanlata
[J]. 3D RESEARCH, 2019, 10 (02)
[33] Facial expression recognition in videos using hybrid CNN & ConvLSTM
Singh R.
Saurav S.
Kumar T.
Saini R.
Vohra A.
Singh S.
[J]. International Journal of Information Technology, 2023, 15 (4) : 1819 - 1830
[34] Tariq U., 2011, Proceedings 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2011), P872, DOI 10.1109/FG.2011.5771365
[35] Facial Expression Recognition Using Weighted Mixture Deep Neural Network Based on Double-Channel Facial Images
Yang, Biao
Cao, Jinmeng
Ni, Ringrong
Zhang, Yuyu
[J]. IEEE ACCESS, 2018, 6 : 4630 - 4640
[36] BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States
Zhalehpour, Sara
Onder, Onur
Akhtar, Zahid
Erdem, Cigdem Eroglu
[J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2017, 8 (03) : 300 - 313
[37] Learning Affective Video Features for Facial Expression Recognition via Hybrid Deep Learning
Zhang, Shiqing
Pan, Xianzhang
Cui, Yueli
Zhao, Xiaoming
Liu, Limei
[J]. IEEE ACCESS, 2019, 7 : 32297 - 32304
[38] Learning Affective Features With a Hybrid Deep Model for Audio-Visual Emotion Recognition
Zhang, Shiqing
Zhang, Shiliang
Huang, Tiejun
Gao, Wen
Tian, Qi
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (10) : 3030 - 3043

← 1 2 3 4 →