Dynamic facial expression recognition based on spatial key-points optimized region feature fusion and temporal self-attention

被引:1
作者
Huang, Zhiwei [1 ]
Zhu, Yu [1 ]
Li, Hangyu [1 ]
Yang, Dawei [2 ,3 ]
机构
[1] East China Univ Sci & Technol, Sch Informat Sci & Engn, Shanghai 200237, Peoples R China
[2] Fudan Univ, Zhongshan Hosp, Dept Pulm & Crit Care Med, Shanghai 200032, Peoples R China
[3] Shanghai Engn Res Ctr Internet Things Resp Med, Shanghai 200032, Peoples R China
基金
中国国家自然科学基金;
关键词
Dynamic facial expression recognition; Spatial feature fusion; Graph convolution network; Self-attention;
D O I
10.1016/j.engappai.2024.108535
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Dynamic facial expression recognition (DFER) is of great significance in promoting empathetic machines and metaverse technology. However, dynamic facial expression recognition (DFER) in the wild remains a challenging task, often constrained by complex lighting changes, frequent key-points occlusion, uncertain emotional peaks and severe imbalanced dataset categories. To tackle these problems, this paper presents a depth neural network model based on spatial key-points optimized region feature fusion and temporal self- attention. The method includes three parts: spatial feature extraction module, temporal feature extraction module and region feature fusion module. The intra-frame spatial feature extraction module is composed of the key-points graph convolution network (GCN) and a convolution network (CNN) branch to obtain the global and local feature vectors. The newly proposed region fusion strategy based on face spatial structure is used to obtain the spatial fusion feature of each frame. The inter-frame temporal feature extraction module uses multi-head self-attention model to obtain the temporal information of inter-frames. The experimental results show that our method achieves accuracy of 68.73%, 55.00%, 47.80%, and 47.44% on the DFEW, AFEW, FERV39k, and MAFW datasets. Ablation experiments showed that the GCN module, fusion module, and temporal module improved the accuracy on DFEW by 0.68%, 1.66%, and 3.25%, respectively. The method also achieves competitive results in terms of parameter quantity and inference speed, which demonstrates the effectiveness of the proposed method.
引用
收藏
页数:12
相关论文
共 31 条
  • [31] Multimodal Fusion of EEG and EMG Signals Using Self-Attention Multi-Temporal Convolutional Neural Networks for Enhanced Hand Gesture Recognition in Rehabilitation
    Zafar, Muhammad Hamza
    Langas, Even Falkenberg
    Nyberg, Svein Olav Glesaaen
    Sanfilippo, Filippo
    2024 IEEE INTERNATIONAL CONFERENCE ON OMNI-LAYER INTELLIGENT SYSTEMS, COINS 2024, 2024, : 245 - 249