FER-Former: Multimodal Transformer for Facial Expression Recognition

被引：2

作者：

Li, Yande ^{[1
,2
,3
]}

Wang, Mingjie ^{[4
]}

Gong, Minglun ^{[2
]}

Lu, Yonggang ^{[1
]}

Liu, Li ^{[5
]}

机构：

[1] Lanzhou Univ, Sch Informat Sci & Engn, Lanzhou 730000, Peoples R China

[2] Univ Guelph, Sch Comp Sci, Guelph, ON N1G 2W1, Canada

[3] Univ Alberta, Dept Elect & Comp Engn, Edmonton, AB T6G 1H9, Canada

[4] Zhejiang Sci Tech Univ, Sch Sci, Hangzhou 310018, Peoples R China

[5] Chongqing Univ, Sch Big Data & Software Engn, Chongqing 401331, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2025年 / 27卷

关键词：

Transformers; Annotations; Semantics; Feature extraction; Head; Computational modeling; Face recognition; Electronic mail; Collaboration; Correlation; Annotation ambiguity; CLIP; facial expression recognition; multimodal; vision transformer; NETWORK;

D O I：

10.1109/TMM.2024.3521788

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The ever-increasing demands for intuitive interactions in virtual reality have led to surging interests in facial expression recognition (FER). There are however several issues commonly seen in existing methods, including narrow receptive fields and homogenous supervisory signals. To address these issues, we propose in this paper a novel multimodal supervision-steering transformer for facial expression recognition in the wild, referred to as FER-former. Specifically, to address the limitation of narrow receptive fields, a hybrid feature extraction pipeline is designed by cascading both prevailing CNNs and transformers. To deal with the issue of homogenous supervisory signals, a heterogeneous domain-steering supervision module is proposed to incorporate text-space semantic correlations to enhance image features, based on the similarity between image and text features. Additionally, a FER-specific transformer encoder is introduced to characterize conventional one-hot label-focusing and CLIP-based text-oriented tokens in parallel for final classification. Based on the collaboration of multifarious token heads, global receptive fields with multimodal semantic cues are captured, delivering superb learning capability. Extensive experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-art methods.

引用

页码：2412 / 2422

页数：11

共 65 条

[1] Covariance Pooling for Facial Expression Recognition [J].

Acharya, Dinesh ;

Huang, Zhiwu ;

Paudel, Danda Pani ;

Van Gool, Luc .

PROCEEDINGS 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2018, :480-487

[2] Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution [J].

Barsoum, Emad ;

Zhang, Cha ;

Ferrer, Cristian Canton ;

Zhang, Zhengyou .

ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, :279-283

[3] Island Loss for Learning Discriminative Features in Facial Expression Recognition [J].

Cai, Jie ;

Meng, Zibo ;

Khan, Ahmed Shehab ;

Li, Zhiyuan ;

O'Reilly, James ;

Tong, Yan .

PROCEEDINGS 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), 2018, :302-309

[4] Label Distribution Learning on Auxiliary Label Space Graphs for Facial Expression Recognition [J].

Chen, Shikai ;

Wang, Jianfeng ;

Chen, Yuedong ;

Shi, Zhongchao ;

Geng, Xin ;

Rui, Yong .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :13981-13990

[5] Understanding and Mitigating Annotation Bias in Facial Expression Recognition [J].

Chen, Yunliang ;

Joo, Jungseock .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :14960-14971

[6] Combining Deep Convolutional Neural Networks With Stochastic Ensemble Weight Optimization for Facial Expression Recognition in the Wild [J].

Choi, Jae Young ;

Lee, Bumshik .

IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :100-111

[7] CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification [J].

Conde, Marcos, V ;

Turgutlu, Kerem .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :3951-3955

[8] ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].

Deng, Jiankang ;

Guo, Jia ;

Xue, Niannan ;

Zafeiriou, Stefanos .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4685-4694

[9]

Dhall A, 2011, 2011 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCV WORKSHOPS)

[10]

Goodfellow Ian J., 2013, Neural Information Processing. 20th International Conference, ICONIP 2013. Proceedings: LNCS 8228, P117, DOI 10.1007/978-3-642-42051-1_16

← 1 2 3 4 5 6 7 →