Enhanced spatio-temporal 3D CNN for facial expression classification in videos

被引:2
作者
Khanna, Deepanshu [1 ]
Jindal, Neeru [1 ]
Rana, Prashant Singh [2 ]
Singh, Harpreet [2 ]
机构
[1] Thapar Inst Engn & Technol, Elect & Commun Engn Dept, Patiala, Punjab, India
[2] Thapar Inst Engn & Technol, Comp Sci Engn Dept, Patiala, Punjab, India
关键词
Face expression recognition; Video pre-processing; Hybrid deep network; RECOGNITION; FACE;
D O I
10.1007/s11042-023-16066-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This article proposes a hybrid network model for video-based human facial expression recognition (FER) system consisting of an end-to-end 3D deep convolutional neural networks. The proposed network combines two commonly used deep 3-dimensional Convolutional Neural Networks (3D CNN) models, ResNet-50 and DenseNet-121, in an end-to-end manner with slight modifications. Currently, various methodologies exist for FER, such as 2-dimensional Convolutional Neural Networks (2D CNN), 2D CNN-Recurrent Neural Networks, 3D CNN, and features extracting algorithms such as PCA and Histogram of oriented gradients (HOG) combined with machine learning classifiers. For the proposed model, we choose 3D CNN over other methods since they preserve temporal information of the videos, unlike 2D CNN. Moreover, these aren't labor-intensive such as various handcrafted feature extracting methods. The proposed system relies on the temporal averaging of information from frame sequences of the video. The databases are pre-processed to remove unwanted backgrounds for training 3D deep CNN from scratch. Initially, feature vectors from video frame sequences are extracted using the 3D ResNet model. These feature vectors are fed to the 3D DenseNet model's blocks, which are then used to classify the predicted emotion. The model is evaluated on three benchmarking databases: Ravdess, CK + , and BAUM1s, which achieved 91.69%, 98.61%, and 73.73% accuracy for the respective databases and outperformed various existing methods. We prove that the proposed architecture works well even for the classes with less amount of training data where many existing 3D CNN networks fail.
引用
收藏
页码:9911 / 9928
页数:18
相关论文
共 38 条
  • [31] Scovanner P., 2007, ACM MM, P357
  • [32] Automatic Facial Expression Recognition Using Combined Geometric Features
    Sharma, Garima
    Singh, Latika
    Gautam, Sumanlata
    [J]. 3D RESEARCH, 2019, 10 (02)
  • [33] Facial expression recognition in videos using hybrid CNN & ConvLSTM
    Singh R.
    Saurav S.
    Kumar T.
    Saini R.
    Vohra A.
    Singh S.
    [J]. International Journal of Information Technology, 2023, 15 (4) : 1819 - 1830
  • [34] Tariq U., 2011, Proceedings 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2011), P872, DOI 10.1109/FG.2011.5771365
  • [35] Facial Expression Recognition Using Weighted Mixture Deep Neural Network Based on Double-Channel Facial Images
    Yang, Biao
    Cao, Jinmeng
    Ni, Ringrong
    Zhang, Yuyu
    [J]. IEEE ACCESS, 2018, 6 : 4630 - 4640
  • [36] BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States
    Zhalehpour, Sara
    Onder, Onur
    Akhtar, Zahid
    Erdem, Cigdem Eroglu
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2017, 8 (03) : 300 - 313
  • [37] Learning Affective Video Features for Facial Expression Recognition via Hybrid Deep Learning
    Zhang, Shiqing
    Pan, Xianzhang
    Cui, Yueli
    Zhao, Xiaoming
    Liu, Limei
    [J]. IEEE ACCESS, 2019, 7 : 32297 - 32304
  • [38] Learning Affective Features With a Hybrid Deep Model for Audio-Visual Emotion Recognition
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    Tian, Qi
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (10) : 3030 - 3043