MSSTNET: A MULTI-SCALE SPATIO-TEMPORAL CNN-TRANSFORMER NETWORK FOR DYNAMIC FACIAL EXPRESSION RECOGNITION

被引：1

作者：

Wang, Linhuang ^{[1
]}

Kang, Xin ^{[1
]}

Ding, Fei ^{[1
]}

Nakagawa, Satoshi ^{[2
]}

Ren, Fuji ^{[3
]}

机构：

[1] Univ Tokushima, Adv Technol & Sci, Tokushima, Japan

[2] Univ Tokyo, Grad Sch Informat Sci Technol, Tokyo, Japan

[3] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年

关键词：

Dynamic facial expression recognition; Affective Computing; Transformer; Spatio-temporal dependencies;

D O I：

10.1109/ICASSP48485.2024.10446699

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a MultiScale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.

引用

页码：3015 / 3019

页数：5

共 24 条

[11]

Huang X, 2014, P 16 INT C MULT INT, P514, DOI [DOI 10.1145/2663204.2666278, 10.1145/2663204.2666278]

[12] DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild [J].

Jiang, Xingxun ;

Zong, Yuan ;

Zheng, Wenming ;

Tang, Chuangao ;

Xia, Wanchuang ;

Lu, Cheng ;

Liu, Jiateng .

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :2881-2889

[13] Recurrent Neural Networks for Emotion Recognition in Video [J].

Kahou, Samira Ebrahimi ;

Michalski, Vincent ;

Konda, Kishore ;

Memisevic, Roland ;

Pal, Christopher .

ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2015, :467-474

[14]

Li H., 2022, ARXIV

[15] Self-supervised Video Hashing via Bidirectional Transformers [J].

Li, Shuyan ;

Li, Xiu ;

Lu, Jiwen ;

Zhou, Jie .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :13544-13553

[16]

Simonyan K, 2015, Arxiv, DOI arXiv:1409.1556

[17] A Closer Look at Spatiotemporal Convolutions for Action Recognition [J].

Tran, Du ;

Wang, Heng ;

Torresani, Lorenzo ;

Ray, Jamie ;

LeCun, Yann ;

Paluri, Manohar .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6450-6459

[18]

Vaswani A, 2017, ADV NEUR IN, V30

[19] Multi-Instance Deep Learning Based on Attention Mechanism for Failure Prediction of Unlabeled Hard Disk Drives [J].

Wang, Guochao ;

Wang, Yu ;

Sun, Xiaojie .

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2021, 70

[20]

Wang Linhuang, 2023, CHIN C PATT REC COMP, P371

← 1 2 3 →