Learning Deep Global Multi-Scale and Local Attention Features for Facial Expression Recognition in the Wild

被引:197
作者
Zhao, Zengqun [1 ]
Liu, Qingshan [1 ]
Wang, Shanmin [2 ,3 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Engn Res Ctr Digital Forens, Minist Educ, Nanjing 210044, Peoples R China
[2] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing 210016, Peoples R China
[3] Minist Educ, Engn Res Ctr Digital Forens, Nanjing 210044, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Face recognition; Image recognition; Faces; Convolution; Image reconstruction; Geometry; Facial expression recognition; deep convolutional neural networks; multi-scale; local attention; INFORMATION; PATCHES; JOINT; POSE;
D O I
10.1109/TIP.2021.3093397
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Facial expression recognition (FER) in the wild received broad concerns in which occlusion and pose variation are two key issues. This paper proposed a global multi-scale and local attention network (MA-Net) for FER in the wild. Specifically, the proposed network consists of three main components: a feature pre-extractor, a multi-scale module, and a local attention module. The feature pre-extractor is utilized to pre-extract middle-level features, the multi-scale module to fuse features with different receptive fields, which reduces the susceptibility of deeper convolution towards occlusion and variant pose, while the local attention module can guide the network to focus on local salient features, which releases the interference of occlusion and non-frontal pose problems on FER in the wild. Extensive experiments demonstrate that the proposed MA-Net achieves the state-of-the-art results on several in-the-wild FER benchmarks: CAER-S, AffectNet-7, AffectNet-8, RAFDB, and SFEW with accuracies of 88.42%, 64.53%, 60.29%, 88.40%, and 59.40% respectively. The codes and training logs are publicly available at https://github.com/zengqunzhao/MA-Net.
引用
收藏
页码:6544 / 6556
页数:13
相关论文
共 75 条
[1]  
Acharya D, 2018, P IEEE C COMP VIS PA, P367
[2]  
[Anonymous], 2018, P ECCV
[3]  
Bin Jiang, 2011, Active Media Technology. Proceedings 7th International Conference, AMT 2011, P92, DOI 10.1007/978-3-642-23620-4_13
[4]  
Bourel Fabrice., 2001, BMVC, P1
[5]   How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks) [J].
Bulat, Adrian ;
Tzimiropoulos, Georgios .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :1021-1030
[6]   Feature-level and Model-level Audiovisual Fusion for Emotion Recognition in the Wild [J].
Cai, Jie ;
Meng, Zibo ;
Khan, Ahmed Shehab ;
Li, Zhiyuan ;
O'Reilly, James ;
Han, Shizhong ;
Liu, Ping ;
Chen, Min ;
Tong, Yan .
2019 2ND IEEE CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2019), 2019, :443-448
[7]   Island Loss for Learning Discriminative Features in Facial Expression Recognition [J].
Cai, Jie ;
Meng, Zibo ;
Khan, Ahmed Shehab ;
Li, Zhiyuan ;
O'Reilly, James ;
Tong, Yan .
PROCEEDINGS 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), 2018, :302-309
[8]   DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs [J].
Chen, Liang-Chieh ;
Papandreou, George ;
Kokkinos, Iasonas ;
Murphy, Kevin ;
Yuille, Alan L. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) :834-848
[9]  
Chen S, 2020, P IEEE CVF C COMP VI, p13 984
[10]  
Chen Y., 2019, P IEEE VIS COMM IM P, P1