CAD - Contextual Multi-modal Alignment for Dynamic AVQA

被引：0

作者：

Nadeem, Asmar ^{[1
]}

Hilton, Adrian ^{[1
]}

Dawes, Robert ^{[1
]}

Thomas, Graham ^{[2
]}

Mustafa, Armin ^{[1
]}

机构：

[1] Univ Surrey, Ctr Vis Speech & Signal Proc CVSSP, Guildford, Surrey, England

[2] BBC Res & Dev, London, England

来源：

2024 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION, WACV 2024 | 2024年

基金：

英国工程与自然科学研究理事会;

关键词：

DIALOG;

D O I：

10.1109/WACV57701.2024.00709

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In the context of Audio Visual Question Answering (AVQA) tasks, the audio and visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.

引用

页码：7236 / 7248

页数：13

共 120 条

[1] Deep Audio-Visual Speech Recognition
Afouras, Triantafyllos
Chung, Joon Son
Senior, Andrew
Vinyals, Oriol
Zisserman, Andrew
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
[2] Akbari H., 2021, P NIPS, P24206
[3] Audio Visual Scene-Aware Dialog
Alamri, Huda
Cartillier, Vincent
Das, Abhishek
Wang, Jue
Cherian, Anoop
Essa, Irfan
Batra, Dhruv
Marks, Tim K.
Hori, Chiori
Anderson, Peter
Lee, Stefan
Parikh, Devi
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7550 - 7559
[4] Alayrac JB, 2022, ADV NEUR IN
[5] Alfasly S, 2022, P IEEE CVF C COMP VI, P20208
[6] [Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1002/CPE.7048
[7] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[8] Look, Listen and Learn
Arandjelovic, Relja
Zisserman, Andrew
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 609 - 617
[9] Arandjelovic Relja, 2018, P EUR C COMP VIS ECC, P435
[10] Multi-Modal Sarcasm Detection and Humor Classification in Code-Mixed Conversations
Bedi, Manjot
Kumar, Shivani
Akhtar, Md Shad
Chakraborty, Tanmoy
[J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (02) : 1363 - 1375

← 1 2 3 4 5 6 7 8 9 10 →