Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

被引:16
作者
Praveen, R. Gnana [1 ]
Cardinal, Patrick [1 ]
Granger, Eric [1 ]
机构
[1] Ecole Technol Super, Dept Syst Engn, Lab Imagerie Vis & Intelligence Artificielle, Montreal, PQ H3C 1K3, Canada
来源
IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE | 2023年 / 5卷 / 03期
基金
加拿大自然科学与工程研究理事会;
关键词
Dimensional emotion recognition; deep learning; multimodal fusion; joint representation; cross-attention; SPEECH; ROBUST;
D O I
10.1109/TBIOM.2022.3233083
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic emotion recognition (ER) has recently gained much interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual's emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, and allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of individual modalities. Deploying the joint A-V feature representation into the cross-attention module helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent. Code is available at https://github.com/praveena2j/Joint-CrossAttention-for-Audio-Visual-Fusion.
引用
收藏
页码:360 / 373
页数:14
相关论文
共 68 条
[1]   Emotion Recognition in Speech using Cross-Modal Transfer in the Wild [J].
Albanie, Samuel ;
Nagrani, Arsha ;
Vedaldi, Andrea ;
Zisserman, Andrew .
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :292-301
[2]   Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 [J].
Anagnostopoulos, Christos-Nikolaos ;
Iliou, Theodoros ;
Giannoukos, Ioannis .
ARTIFICIAL INTELLIGENCE REVIEW, 2015, 43 (02) :155-177
[3]  
[Anonymous], 2015, INT WORKSH AUD VIS E, DOI [10.1145/2808196.2811640, DOI 10.1145/2808196.2811640]
[4]   Multimodal Machine Learning: A Survey and Taxonomy [J].
Baltrusaitis, Tadas ;
Ahuja, Chaitanya ;
Morency, Louis-Philippe .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443
[5]  
Ben-Yacoub S., 1999, P CVPR, V1, P1580
[6]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[7]   Iterative Distillation for Better Uncertainty Estimates in Multitask Emotion Recognition [J].
Deng, Didan ;
Wu, Liang ;
Shi, Bertram E. .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, :3550-3559
[8]   Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention [J].
Duan, Bin ;
Tang, Hao ;
Wang, Wei ;
Zong, Ziliang ;
Yang, Guowei ;
Yan, Yan .
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, :4012-4021
[9]   AN ARGUMENT FOR BASIC EMOTIONS [J].
EKMAN, P .
COGNITION & EMOTION, 1992, 6 (3-4) :169-200
[10]  
Ghaleb E, 2020, IEEE IMAGE PROC, P251, DOI [10.1109/ICIP40778.2020.9191019, 10.1109/icip40778.2020.9191019]