Enhanced spatio-temporal 3D CNN for facial expression classification in videos

被引：2

作者：

Khanna, Deepanshu ^{[1
]}

Jindal, Neeru ^{[1
]}

Rana, Prashant Singh ^{[2
]}

Singh, Harpreet ^{[2
]}

机构：

[1] Thapar Inst Engn & Technol, Elect & Commun Engn Dept, Patiala, Punjab, India

[2] Thapar Inst Engn & Technol, Comp Sci Engn Dept, Patiala, Punjab, India

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2024年 / 83卷 / 04期

关键词：

Face expression recognition; Video pre-processing; Hybrid deep network; RECOGNITION; FACE;

D O I：

10.1007/s11042-023-16066-6

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This article proposes a hybrid network model for video-based human facial expression recognition (FER) system consisting of an end-to-end 3D deep convolutional neural networks. The proposed network combines two commonly used deep 3-dimensional Convolutional Neural Networks (3D CNN) models, ResNet-50 and DenseNet-121, in an end-to-end manner with slight modifications. Currently, various methodologies exist for FER, such as 2-dimensional Convolutional Neural Networks (2D CNN), 2D CNN-Recurrent Neural Networks, 3D CNN, and features extracting algorithms such as PCA and Histogram of oriented gradients (HOG) combined with machine learning classifiers. For the proposed model, we choose 3D CNN over other methods since they preserve temporal information of the videos, unlike 2D CNN. Moreover, these aren't labor-intensive such as various handcrafted feature extracting methods. The proposed system relies on the temporal averaging of information from frame sequences of the video. The databases are pre-processed to remove unwanted backgrounds for training 3D deep CNN from scratch. Initially, feature vectors from video frame sequences are extracted using the 3D ResNet model. These feature vectors are fed to the 3D DenseNet model's blocks, which are then used to classify the predicted emotion. The model is evaluated on three benchmarking databases: Ravdess, CK + , and BAUM1s, which achieved 91.69%, 98.61%, and 73.73% accuracy for the respective databases and outperformed various existing methods. We prove that the proposed architecture works well even for the classes with less amount of training data where many existing 3D CNN networks fail.

引用

页码：9911 / 9928

页数：18

共 38 条

[1] A 3D CNN-LSTM-Based Image-to-Image Foreground Segmentation
Akilan, Thangarajah
Wu, Qingming Jonathan
Safaei, Amin
Huo, Jie
Yang, Yimin
[J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2020, 21 (03) : 959 - 971
[2] Aly S, 2016, IEEE WINT CONF APPL
[3] [Anonymous], 2003, 2003 C COMP VIS PATT, DOI DOI 10.1109/CVPRW.2003.10057
[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[5] Face recognition using Histograms of Oriented Gradients
Deniz, O.
Bueno, G.
Salido, J.
De la Torre, F.
[J]. PATTERN RECOGNITION LETTERS, 2011, 32 (12) : 1598 - 1603
[6] Dhankhar P, 2019, RESNET 50 VGG 16 REC, V13, P1, DOI [10.21172/ijiet.134.18, DOI 10.21172/IJIET.134.18]
[7] Video-Based Emotion Recognition using CNN-RNN and C3D Hybrid Networks
Fan, Yin
Lu, Xiangju
Li, Dian
Liu, Yuanliu
[J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 445 - 450
[8] Deep Neural Network Architecture: Application for Facial Expression Recognition
Garcia, M.
Ramirez, S.
[J]. IEEE LATIN AMERICA TRANSACTIONS, 2020, 18 (07) : 1311 - 1319
[9] Ghaleb E, 2019, INT CONF AFFECT, DOI [10.1109/acii.2019.8925444, 10.1109/ACII.2019.8925444]
[10] Haddad Jad, 2020, Advances in Visual Computing. 15th International Symposium, ISVC 2020. Proceedings. Lecture Notes in Computer Science (LNCS 12510), P298, DOI 10.1007/978-3-030-64559-5_23

← 1 2 3 4 →