Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

被引：15

作者：

Zhang, Xiaoqin ^{[1
]}

Li, Min ^{[1
]}

Lin, Sheng ^{[1
]}

Xu, Hang ^{[1
]}

Xiao, Guobao ^{[1
]}

机构：

[1] Wenzhou Univ, Key Lab Intelligent Informat Safety & Emergency Zh, Wenzhou 325035, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 05期

基金：

中国国家自然科学基金;

关键词：

Dynamic facial expression recognition; multimodal information fusion; semantic alignment; deep learning; NETWORK; AWARE;

D O I：

10.1109/TCSVT.2023.3312858

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Dynamic expression recognition in the wild is a challenging task due to various obstacles, including low light condition, non-positive face, and face occlusion. Purely vision-based approaches may not suffice to accurately capture the complexity of human emotions. To address this issue, we propose a Transformer-based Multimodal Emotional Perception (T-MEP) framework capable of effectively extracting multimodal information and achieving significant augmentation. Specifically, we design three transformer-based encoders to extract modality-specific features from audio, image, and text sequences, respectively. Each encoder is carefully designed to maximize its adaptation to the corresponding modality. In addition, we design a transformer-based multimodal information fusion module to model cross-modal representation among these modalities. The unique combination of self-attention and cross-attention in this module enhances the robustness of output-integrated features in encoding emotion. By mapping the information from audio and textual features to the latent space of visual features, this module aligns the semantics of the three modalities for cross-modal information augmentation. Finally, we evaluate our method on three popular datasets (MAFW, DFEW, and AFEW) through extensive experiments, which demonstrate its state-of-the-art performance. This research offers a promising direction for future studies to improve emotion recognition accuracy by exploiting the power of multimodal features.

引用

页码：3192 / 3203

页数：12

共 56 条

[11] MULTIMODAL TRANSFORMER WITH LEARNABLE FRONTEND AND SELF ATTENTION FOR EMOTION RECOGNITION [J].

Dutta, Soumya ;

Ganapathy, Sriram .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6917-6921

[12] Temporal Multimodal Graph Transformer With Global-Local Alignment for Video-Text Retrieval [J].

Feng, Zerun ;

Zeng, Zhimin ;

Guo, Caili ;

Li, Zheng .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) :1438-1453

[13]

Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

[14] Egocentric Early Action Prediction via Multimodal Transformer-Based Dual Action Prediction [J].

Guan, Weili ;

Song, Xuemeng ;

Wang, Kejie ;

Wen, Haokun ;

Ni, Hongda ;

Wang, Yaowei ;

Chang, Xiaojun .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) :4472-4483

[15] Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks [J].

Hasani, Behzad ;

Mahoor, Mohammad H. .

2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2017, :2278-2288

[16] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[17] Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning [J].

He, Mengge ;

Du, Wenjing ;

Wen, Zhiquan ;

Du, Qing ;

Xie, Yutong ;

Wu, Qi .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (06) :2990-3002

[18] Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers [J].

Hendricks, Lisa Anne ;

Mellor, John ;

Schneider, Rosalia ;

Alayrac, Jean-Baptiste ;

Nematzadeh, Aida .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 :570-585

[19] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [J].

Hsu, Wei-Ning ;

Bolte, Benjamin ;

Tsai, Yao-Hung Hubert ;

Lakhotia, Kushal ;

Salakhutdinov, Ruslan ;

Mohamed, Abdelrahman .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3451-3460

[20]

Huang X, 2014, P 16 INT C MULT INT, P514, DOI [10.1145/2663204.2666278, DOI 10.1145/2663204.2666278]

← 1 2 3 4 5 6 →