A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

被引：44

作者：

Praveen, R. Gnana ^{[1
]}

de Melo, Wheidima Carneiro ^{[1
]}

Ullah, Nasib ^{[1
]}

Aslam, Haseeb ^{[1
]}

Zeeshan, Osama ^{[1
]}

Denorme, Theo ^{[1
]}

Pedersoli, Marco ^{[1
]}

Koerich, Alessandro L. ^{[1
]}

Bacon, Simon ^{[2
]}

Cardinal, Patrick ^{[1
]}

Granger, Eric ^{[1
]}

机构：

[1] Ecole Technol Super, LIVIA, Montreal, PQ, Canada

[2] Concordia Univ, Dept Hlth Kinesiol & Appl Physiol, Montreal, PQ, Canada

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022 | 2022年

关键词：

D O I：

10.1109/CVPRW56347.2022.00278

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary modalities, such as audio, visual, and biosignals. However, most state-of-the-art audio-visual (A-V) fusion methods rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. This paper focuses on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. We propose a joint cross-attention fusion model that can effectively exploit the complementary inter-modal relationships, allowing for an accurate prediction of valence and arousal. In particular, this model computes cross-attention weights based on the correlation between joint feature representations and individual modalities. By deploying a joint A-V feature representation into the cross-attention module, the performance of our fusion model improves significantly over the vanilla cross-attention module. Experimental results(1) on the AffWild2 dataset highlight the robustness of our proposed A-V fusion model. It has achieved a concordance correlation coefficient (CCC) of 0.374 (0.663) and 0.363 (0.584) for valence and arousal, respectively, on the test set (validation set). This represents a significant improvement over the baseline for the third challenge of Affective Behavior Analysis in-the-Wild 2022 (ABAW3) competition, with a CCC of 0.180 (0.310) and 0.170 (0.170).

引用

页码：2485 / 2494

页数：10

共 51 条

[1]

[Anonymous], 2018, CVPR

[2]

[Anonymous], 2021, ICCV WORKSH, DOI DOI 10.1109/COMPEL52922.2021.9645992

[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[4]

Deng Didan, 2021, ICCV WORKSH

[5]

Duan Bin, 2021, WACV

[6] AN ARGUMENT FOR BASIC EMOTIONS [J].

EKMAN, P .

COGNITION & EMOTION, 1992, 6 (3-4) :169-200

[7]

Glorot X., 2010, P 13 INT C ART INT S, P249

[8]

Gnana Praveen R., 2021, FG

[9]

Gnana Praveen R, 2021, ARXIV210109858

[10] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

← 1 2 3 4 5 6 →