Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

被引:15
作者
Zhang, Xiaoqin [1 ]
Li, Min [1 ]
Lin, Sheng [1 ]
Xu, Hang [1 ]
Xiao, Guobao [1 ]
机构
[1] Wenzhou Univ, Key Lab Intelligent Informat Safety & Emergency Zh, Wenzhou 325035, Peoples R China
基金
中国国家自然科学基金;
关键词
Dynamic facial expression recognition; multimodal information fusion; semantic alignment; deep learning; NETWORK; AWARE;
D O I
10.1109/TCSVT.2023.3312858
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Dynamic expression recognition in the wild is a challenging task due to various obstacles, including low light condition, non-positive face, and face occlusion. Purely vision-based approaches may not suffice to accurately capture the complexity of human emotions. To address this issue, we propose a Transformer-based Multimodal Emotional Perception (T-MEP) framework capable of effectively extracting multimodal information and achieving significant augmentation. Specifically, we design three transformer-based encoders to extract modality-specific features from audio, image, and text sequences, respectively. Each encoder is carefully designed to maximize its adaptation to the corresponding modality. In addition, we design a transformer-based multimodal information fusion module to model cross-modal representation among these modalities. The unique combination of self-attention and cross-attention in this module enhances the robustness of output-integrated features in encoding emotion. By mapping the information from audio and textual features to the latent space of visual features, this module aligns the semantics of the three modalities for cross-modal information augmentation. Finally, we evaluate our method on three popular datasets (MAFW, DFEW, and AFEW) through extensive experiments, which demonstrate its state-of-the-art performance. This research offers a promising direction for future studies to improve emotion recognition accuracy by exploiting the power of multimodal features.
引用
收藏
页码:3192 / 3203
页数:12
相关论文
共 56 条
[1]  
Akhtar MS, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P370
[2]  
Baevski A, 2020, ADV NEUR IN, V33
[3]   Feature-level and Model-level Audiovisual Fusion for Emotion Recognition in the Wild [J].
Cai, Jie ;
Meng, Zibo ;
Khan, Ahmed Shehab ;
Li, Zhiyuan ;
O'Reilly, James ;
Han, Shizhong ;
Liu, Ping ;
Chen, Min ;
Tong, Yan .
2019 2ND IEEE CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2019), 2019, :443-448
[4]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[5]   RetinaFace: Single-shot Multi-level Face Localisation in the Wild [J].
Deng, Jiankang ;
Guo, Jia ;
Ververas, Evangelos ;
Kotsia, Irene ;
Zafeiriou, Stefanos .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :5202-5211
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]   Emotion Recognition In The Wild Challenge 2013 [J].
Dhall, Abhinav ;
Goecke, Roland ;
Joshi, Jyoti ;
Wagner, Michael ;
Gedeon, Tom .
ICMI'13: PROCEEDINGS OF THE 2013 ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2013, :509-515
[8]   Collecting Large, Richly Annotated Facial-Expression Databases from Movies [J].
Dhall, Abhinav ;
Goecke, Roland ;
Lucey, Simon ;
Gedeon, Tom .
IEEE MULTIMEDIA, 2012, 19 (03) :34-41
[9]   Temporal Relation Inference Network for Multimodal Speech Emotion Recognition [J].
Dong, Guan-Nan ;
Pun, Chi-Man ;
Zhang, Zheng .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (09) :6472-6485
[10]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497