MSSTNET: A MULTI-SCALE SPATIO-TEMPORAL CNN-TRANSFORMER NETWORK FOR DYNAMIC FACIAL EXPRESSION RECOGNITION

被引:1
作者
Wang, Linhuang [1 ]
Kang, Xin [1 ]
Ding, Fei [1 ]
Nakagawa, Satoshi [2 ]
Ren, Fuji [3 ]
机构
[1] Univ Tokushima, Adv Technol & Sci, Tokushima, Japan
[2] Univ Tokyo, Grad Sch Informat Sci Technol, Tokyo, Japan
[3] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年
关键词
Dynamic facial expression recognition; Affective Computing; Transformer; Spatio-temporal dependencies;
D O I
10.1109/ICASSP48485.2024.10446699
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a MultiScale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.
引用
收藏
页码:3015 / 3019
页数:5
相关论文
共 24 条
[1]  
[Anonymous], 2023, ICASSP 2023, DOI DOI 10.1109/SEENG59157.2023.00005
[2]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[3]  
Chen J., 2014, P 16 INT C MULT INT, P508
[4]  
Chung J., 2014, ARXIV PREPRINT ARXIV
[5]   Emotion Recognition In The Wild Challenge 2013 [J].
Dhall, Abhinav ;
Goecke, Roland ;
Joshi, Jyoti ;
Wagner, Michael ;
Gedeon, Tom .
ICMI'13: PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2013, :509-515
[6]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[7]   Video-Based Emotion Recognition using CNN-RNN and C3D Hybrid Networks [J].
Fan, Yin ;
Lu, Xiangju ;
Li, Dian ;
Liu, Yuanliu .
ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, :445-450
[8]   Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [J].
Hara, Kensho ;
Kataoka, Hirokatsu ;
Satoh, Yutaka .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6546-6555
[9]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[10]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]