MSSTNET: A MULTI-SCALE SPATIO-TEMPORAL CNN-TRANSFORMER NETWORK FOR DYNAMIC FACIAL EXPRESSION RECOGNITION

被引:1
作者
Wang, Linhuang [1 ]
Kang, Xin [1 ]
Ding, Fei [1 ]
Nakagawa, Satoshi [2 ]
Ren, Fuji [3 ]
机构
[1] Univ Tokushima, Adv Technol & Sci, Tokushima, Japan
[2] Univ Tokyo, Grad Sch Informat Sci Technol, Tokyo, Japan
[3] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年
关键词
Dynamic facial expression recognition; Affective Computing; Transformer; Spatio-temporal dependencies;
D O I
10.1109/ICASSP48485.2024.10446699
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a MultiScale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.
引用
收藏
页码:3015 / 3019
页数:5
相关论文
共 24 条
[11]  
Huang X, 2014, P 16 INT C MULT INT, P514, DOI [DOI 10.1145/2663204.2666278, 10.1145/2663204.2666278]
[12]   DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild [J].
Jiang, Xingxun ;
Zong, Yuan ;
Zheng, Wenming ;
Tang, Chuangao ;
Xia, Wanchuang ;
Lu, Cheng ;
Liu, Jiateng .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :2881-2889
[13]   Recurrent Neural Networks for Emotion Recognition in Video [J].
Kahou, Samira Ebrahimi ;
Michalski, Vincent ;
Konda, Kishore ;
Memisevic, Roland ;
Pal, Christopher .
ICMI'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2015, :467-474
[14]  
Li H., 2022, ARXIV
[15]   Self-supervised Video Hashing via Bidirectional Transformers [J].
Li, Shuyan ;
Li, Xiu ;
Lu, Jiwen ;
Zhou, Jie .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :13544-13553
[16]  
Simonyan K, 2015, Arxiv, DOI arXiv:1409.1556
[17]   A Closer Look at Spatiotemporal Convolutions for Action Recognition [J].
Tran, Du ;
Wang, Heng ;
Torresani, Lorenzo ;
Ray, Jamie ;
LeCun, Yann ;
Paluri, Manohar .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6450-6459
[18]  
Vaswani A, 2017, ADV NEUR IN, V30
[19]   Multi-Instance Deep Learning Based on Attention Mechanism for Failure Prediction of Unlabeled Hard Disk Drives [J].
Wang, Guochao ;
Wang, Yu ;
Sun, Xiaojie .
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2021, 70
[20]  
Wang Linhuang, 2023, CHIN C PATT REC COMP, P371