Label-Guided Dynamic Spatial-Temporal Fusion for Video-Based Facial Expression Recognition

被引：0

作者：

Zhang, Ziyang ^{[1
]}

Tian, Xiang ^{[1
]}

Zhang, Yuan ^{[1
]}

Guo, Kailing ^{[1
,2
]}

Xu, Xiangmin ^{[1
,2
,3
]}

机构：

[1] South China Univ Technol, Guangzhou 510641, Peoples R China

[2] Pazhou Lab, Guangzhou 510330, Peoples R China

[3] Hefei Comprehens Natl Sci Ctr, Inst Aritificial Intelligence, Hefei 230088, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

关键词：

Feature extraction; Transformers; Convolutional neural networks; Three-dimensional displays; Face recognition; Data mining; Entropy; Facial expression recognition; spatial-temporal fusion; dynamic weights; frame label; FEATURES;

D O I：

10.1109/TMM.2024.3407693

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video-based facial expression recognition (FER) in the wild is a common yet challenging task. Extracting spatial and temporal features simultaneously is a common approach but may not always yield optimal results due to the distinct nature of spatial and temporal information. Extracting spatial and temporal features cascadingly has been proposed as an alternative approach However, the results of video-based FER sometimes fall short compared to image-based FER, indicating underutilization of spatial information of each frame and suboptimal modeling of frame relations in spatial-temporal fusion strategies. Although frame label is highly related to video label, it is overlooked in previous video-based FER methods. This paper proposes label-guided dynamic spatial-temporal fusion (LG-DSTF) that adopts frame labels to enhance the discriminative ability of spatial features and guide temporal fusion. By assigning each frame a video label, two auxiliary classification loss functions are constructed to steer discriminative spatial feature learning at different levels. The cross entropy between a uniform distribution and label distribution of spatial features is utilized to measure the classification confidence of each frame. The confidence values serve as dynamic weights to emphasize crucial frames during temporal fusion of spatial features. Our LG-DSTF achieves state-of-the-art results on FER benchmarks.

引用

页码：10503 / 10513

页数：11

共 61 条

[1] Survey on RGB, 3D, Thermal, and Multimodal Approaches for Facial Expression Recognition: History, Trends, and Affect-Related Applications [J].

Adrian Corneanu, Ciprian ;

Oliu Simon, Marc ;

Cohn, Jeffrey F. ;

Escalera Guerrero, Sergio .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (08) :1548-1568

[2]

Baddar WJ, 2019, AAAI CONF ARTIF INTE, P3215

[3]

Chung J., 2014, NIPS 2014 WORKSH DEE, DOI DOI 10.48550/ARXIV.1412.3555

[4]

Churamani N, 2021, IEEE INT CONF AUTOMA

[5] RetinaFace: Single-shot Multi-level Face Localisation in the Wild [J].

Deng, Jiankang ;

Guo, Jia ;

Ververas, Evangelos ;

Kotsia, Irene ;

Zafeiriou, Stefanos .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :5202-5211

[6]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[7] Collecting Large, Richly Annotated Facial-Expression Databases from Movies [J].

Dhall, Abhinav ;

Goecke, Roland ;

Lucey, Simon ;

Gedeon, Tom .

IEEE MULTIMEDIA, 2012, 19 (03) :34-41

[8]

Dosovitskiy A., 2021, 9 INT C LEARN REPR I

[9] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[10] Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction [J].

Duric, Z ;

Gray, WD ;

Heishman, R ;

Li, FY ;

Rosenfeld, A ;

Schoelles, MJ ;

Schunn, C ;

Wechsler, H .

PROCEEDINGS OF THE IEEE, 2002, 90 (07) :1272-1289

← 1 2 3 4 5 6 7 →