Transformer-based Siamese and Triplet Networks for Facial Expression Intensity Estimation

被引:0
作者
Sabri, Motaz [1 ]
机构
[1] Ridge I, Otemachi Bldg 438,1-6-1 Otemachi,Chiyoda ku, Tokyo 1000004, Japan
关键词
Emotional intensity estimation; Metric learning; Self-attention; NEURAL-NETWORK; RECOGNITION; ATTENTION; FEATURES; INFORMATION; DATABASES;
D O I
10.5057/ijae.IJAE-D-22-00011
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Recognizing facial expressions and estimating their corresponding action units' intensities have achieved many milestones. However, such estimating is still challenging due to subtle action units' variations during emotional arousal. The latest approaches are confined to the probabilistic models' characteristics that model action units' relationships. Considering ordinal relationships across an emotional transition sequence, we propose two metric learning approaches with self-attention-based triplet and Siamese networks to estimate emotional intensities. Our emotion expert branches use shifted-window SWIN-transformer which restricts self-attention computation to adjacent windows while also allowing for cross-window connection. This offers flexible modeling at various scales of action units with high performance. We evaluated our network's spatial and time-based feature localization on CK+, KDEF-dyn, AFEW, SAMM, and CASME-II datasets. They outperform deep learning state-of-the-art methods in micro-expression detection on the latter two datasets with 2.4% and 2.6% UAR respectively. Ablation studies highlight the strength of our design with a thorough analysis.
引用
收藏
页数:15
相关论文
共 75 条
[1]   Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers [J].
Akcay, Mehmet Berkehan ;
Oguz, Kaya .
SPEECH COMMUNICATION, 2020, 116 :56-76
[2]  
[Anonymous], 2019, IEEE INT CONF AUTOMA, DOI [DOI 10.1017/S0033291719003349, DOI 10.1109/fg.2019.8756583]
[3]  
[Anonymous], 2010, 2010 IEEE COMPUTER S, DOI [10. 1109/CVPRW.2010.5543262, DOI 10.1109/CVPRW.2010.5543262]
[4]   NEURAL NETWORKS FOR FINGERPRINT RECOGNITION [J].
BALDI, P ;
CHAUVIN, Y .
NEURAL COMPUTATION, 1993, 5 (03) :402-418
[5]   USING MUTUAL INFORMATION FOR SELECTING FEATURES IN SUPERVISED NEURAL-NET LEARNING [J].
BATTITI, R .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (04) :537-550
[6]   Attention Augmented Convolutional Networks [J].
Bello, Irwan ;
Zoph, Barret ;
Vaswani, Ashish ;
Shlens, Jonathon ;
Le, Quoc V. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294
[7]   Human Observers and Automated Assessment of Dynamic Emotional Facial Expressions: KDEF-dyn Database Validation [J].
Calvo, Manuel G. ;
Fernandez-Martina, Andres ;
Recio, Guillermo ;
Lundqvist, Daniel .
FRONTIERS IN PSYCHOLOGY, 2018, 9
[8]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[9]  
Chen ZL, 2018, Arxiv, DOI [arXiv:1811.07988, DOI 10.48550/ARXIV.1811.07988, 10.48550/arXiv.1811.07988]
[10]   SAMM: A Spontaneous Micro-Facial Movement Dataset [J].
Davison, Adrian K. ;
Lansley, Cliff ;
Costen, Nicholas ;
Tan, Kevin ;
Yap, Moi Hoon .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2018, 9 (01) :116-129