MSSTNET: A MULTI-SCALE SPATIO-TEMPORAL CNN-TRANSFORMER NETWORK FOR DYNAMIC FACIAL EXPRESSION RECOGNITION

被引：1

作者：

Wang, Linhuang ^{[1
]}

Kang, Xin ^{[1
]}

Ding, Fei ^{[1
]}

Nakagawa, Satoshi ^{[2
]}

Ren, Fuji ^{[3
]}

机构：

[1] Univ Tokushima, Adv Technol & Sci, Tokushima, Japan

[2] Univ Tokyo, Grad Sch Informat Sci Technol, Tokyo, Japan

[3] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年

关键词：

Dynamic facial expression recognition; Affective Computing; Transformer; Spatio-temporal dependencies;

D O I：

10.1109/ICASSP48485.2024.10446699

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a MultiScale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.

引用

页码：3015 / 3019

页数：5

共 24 条

[1]

[Anonymous], 2023, ICASSP 2023, DOI DOI 10.1109/SEENG59157.2023.00005

[2]

Bertasius G, 2021, PR MACH LEARN RES, V139

[3]

Chen J., 2014, P 16 INT C MULT INT, P508

[4]

Chung J., 2014, ARXIV PREPRINT ARXIV

[5] Emotion Recognition In The Wild Challenge 2013 [J].

Dhall, Abhinav ;

Goecke, Roland ;

Joshi, Jyoti ;

Wagner, Michael ;

Gedeon, Tom .

ICMI'13: PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2013, :509-515

[6] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[7] Video-Based Emotion Recognition using CNN-RNN and C3D Hybrid Networks [J].

Fan, Yin ;

Lu, Xiangju ;

Li, Dian ;

Liu, Yuanliu .

ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, :445-450

[8] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [J].

Hara, Kensho ;

Kataoka, Hirokatsu ;

Satoh, Yutaka .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6546-6555

[9] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[10]

Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

← 1 2 3 →