A spatio-temporal integrated model based on local and global features for video expression recognition

被引:0
作者
Min Hu
Peng Ge
Xiaohua Wang
Hui Lin
Fuji Ren
机构
[1] Hefei University of Technology,Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education
[2] Hefei University of Technology,School of Computer and Information, Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine
[3] Hefei University of Technology,School of Electronic Science and Application Physics
[4] University of Tokushima,Graduate School of Advanced Technology and Science
来源
The Visual Computer | 2022年 / 38卷
关键词
Video expression recognition; Local and global features; Attention mechanism; Feature recalibration; Network integration;
D O I
暂无
中图分类号
学科分类号
摘要
Facial expressions can be represented largely by the dynamic variations of important facial expression parts, i.e., eyebrows, eyes, nose, and mouth. The features of these parts are regarded as local features. However, facial global information is also useful for recognition because it is a necessary complement to local features. In this paper, a spatio-temporal integrated model that jointly learns local and global features is proposed for video expression recognition. Firstly, to capture the action of facial key units, a spatio-temporal attention part-gradient-based hierarchical bidirectional recurrent neural network (spatio-temporal attention PGHRNN) is constructed. It can capture the dynamic variations of gradients around facial landmark points. In addition, a new kind of spatial attention mechanism is introduced to recalibrate the features of facial various parts adaptively. Secondly, to complement the local features extracted by the spatio-temporal attention PGHRNN, a squeeze-and-excitation residual network of 50 layers with long short-term memory network (SE-ResNet-50-LSTM) is used as a global feature extractor and classifier. Finally, to integrate the local and global features and improve the performance of facial expression recognition, a joint adaptive fine-tuning method (JAFTM) is proposed to combine the two networks, which can adaptively adjust the network weights. Extensive experiments demonstrate that our proposed model can achieve a superior recognition accuracy of 98.95% on CK + for 7-class facial expressions and 85.40% on MMI database, which outperforms other state-of-the-art methods.
引用
收藏
页码:2617 / 2634
页数:17
相关论文
共 87 条
  • [1] Calvo RA(2010)Affect detection: An interdisciplinary review of models, methods, and their applications IEEE Trans. Affect. Comput. 1 18-37
  • [2] D'Mello S(2014)Automatic facial expression recognition using features of salient facial patches IEEE Trans. Affect. Comput. 6 1-12
  • [3] Happy S(2020)Using CNN for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy Vis. Comput. 36 405-412
  • [4] Routray A(2013)Human emotion recognition from videos using spatio-temporal and audio features Vis. Comput. 29 1269-1275
  • [5] Agrawal A(2018)Deeper cascaded peak-piloted network for weak expression recognition Vis. Comput. 34 1691-1699
  • [6] Mittal N(2002)Multiresolution gray-scale and rotation invariant texture classification with local binary patterns IEEE Trans. Pattern Anal. Mach. Intell. 24 971-987
  • [7] Rashid M(2004)Distinctive image features from scale-invariant keypoints Int. J. Comput. Vis. 60 91-110
  • [8] Abu-Bakar S(2018)Anubhav: recognizing emotions through facial expression Vis. Comput. 34 177-191
  • [9] Mokji M(2017)Facial expression recognition based on deep evolutional spatial-temporal networks IEEE Trans. Image Process. 26 4193-4203
  • [10] Yu Z(2000)The dynamic representation of scenes Vis. Cogn. 7 17-42