A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

被引:44
作者
Praveen, R. Gnana [1 ]
de Melo, Wheidima Carneiro [1 ]
Ullah, Nasib [1 ]
Aslam, Haseeb [1 ]
Zeeshan, Osama [1 ]
Denorme, Theo [1 ]
Pedersoli, Marco [1 ]
Koerich, Alessandro L. [1 ]
Bacon, Simon [2 ]
Cardinal, Patrick [1 ]
Granger, Eric [1 ]
机构
[1] Ecole Technol Super, LIVIA, Montreal, PQ, Canada
[2] Concordia Univ, Dept Hlth Kinesiol & Appl Physiol, Montreal, PQ, Canada
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022 | 2022年
关键词
D O I
10.1109/CVPRW56347.2022.00278
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary modalities, such as audio, visual, and biosignals. However, most state-of-the-art audio-visual (A-V) fusion methods rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. This paper focuses on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention fusion model that can effectively exploit the complementary inter-modal relationships, allowing for an accurate prediction of valence and arousal. In particular, this model computes cross-attention weights based on the correlation between joint feature representations and individual modalities. By deploying a joint A-V feature representation into the cross-attention module, the performance of our fusion model improves significantly over the vanilla cross-attention module. Experimental results(1) on the AffWild2 dataset highlight the robustness of our proposed A-V fusion model. It has achieved a concordance correlation coefficient (CCC) of 0.374 (0.663) and 0.363 (0.584) for valence and arousal, respectively, on the test set (validation set). This represents a significant improvement over the baseline for the third challenge of Affective Behavior Analysis in-the-Wild 2022 (ABAW3) competition, with a CCC of 0.180 (0.310) and 0.170 (0.170).
引用
收藏
页码:2485 / 2494
页数:10
相关论文
共 51 条
[1]  
[Anonymous], 2018, CVPR
[2]  
[Anonymous], 2021, ICCV WORKSH, DOI DOI 10.1109/COMPEL52922.2021.9645992
[3]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[4]  
Deng Didan, 2021, ICCV WORKSH
[5]  
Duan Bin, 2021, WACV
[6]   AN ARGUMENT FOR BASIC EMOTIONS [J].
EKMAN, P .
COGNITION & EMOTION, 1992, 6 (3-4) :169-200
[7]  
Glorot X., 2010, P 13 INT C ART INT S, P249
[8]  
Gnana Praveen R., 2021, FG
[9]  
Gnana Praveen R, 2021, ARXIV210109858
[10]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778