Facial Expression Recognition Based on Vision Transformer with Hybrid Local Attention

被引：1

作者：

Tian, Yuan ^{[1
]}

Zhu, Jingxuan ^{[1
]}

Yao, Huang ^{[1
]}

Chen, Di ^{[1
]}

机构：

[1] Cent China Normal Univ, Fac Artificial Intelligence Educ, Wuhan 430079, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 15期

关键词：

facial expression recognition; attention; vision transformer;

D O I：

10.3390/app14156471

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Facial expression recognition has wide application prospects in many occasions. Due to the complexity and variability of facial expressions, facial expression recognition has become a very challenging research topic. This paper proposes a Vision Transformer expression recognition method based on hybrid local attention (HLA-ViT). The network adopts a dual-stream structure. One stream extracts the hybrid local features and the other stream extracts the global contextual features. These two streams constitute a global-local fusion attention. The hybrid local attention module is proposed to enhance the network's robustness to face occlusion and head pose variations. The convolutional neural network is combined with the hybrid local attention module to obtain feature maps with local prominent information. Robust features are then captured by the ViT from the global perspective of the visual sequence context. Finally, the decision-level fusion mechanism fuses the expression features with local prominent information, adding complementary information to enhance the network's recognition performance and robustness against interference factors such as occlusion and head posture changes in natural scenes. Extensive experiments demonstrate that our HLA-ViT network achieves an excellent performance with 90.45% on RAF-DB, 90.13% on FERPlus, and 65.07% on AffectNet.

引用

页数：15

共 35 条

[1] Using CNN for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy [J].

Agrawal, Abhinav ;

Mittal, Namita .

VISUAL COMPUTER, 2020, 36 (02) :405-412

[2] Facial Emotion Recognition Using Transfer Learning in the Deep CNN [J].

Akhand, M. A. H. ;

Roy, Shuvendu ;

Siddique, Nazmul ;

Kamal, Md Abdus Samad ;

Shimamura, Tetsuya .

ELECTRONICS, 2021, 10 (09)

[3]

Dosovitskiy A, 2021, INT C LEARN REPR, DOI DOI 10.48550/ARXIV.2010.11929

[4] Improved Residual Networks for Image and Video Recognition [J].

Duta, Ionut Cosmin ;

Liu, Li ;

Zhu, Fan ;

Shao, Ling .

2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, :9415-9422

[5] Spatiotemporal Multiplier Networks for Video Action Recognition [J].

Feichtenhofer, Christoph ;

Pinz, Axel ;

Wildes, Richard P. .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :7445-7454

[6] Convolutional Two-Stream Network Fusion for Video Action Recognition [J].

Feichtenhofer, Christoph ;

Pinz, Axel ;

Zisserman, Andrew .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941

[7]

Howard AG, 2017, Arxiv, DOI arXiv:1704.04861

[8] MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition [J].

Guo, Yandong ;

Zhang, Lei ;

Hu, Yuxiao ;

He, Xiaodong ;

Gao, Jianfeng .

COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 :87-102

[9] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[10]

Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/TPAMI.2019.2913372, 10.1109/CVPR.2018.00745]

← 1 2 3 4 →