CAD - Contextual Multi-modal Alignment for Dynamic AVQA

被引:0
作者
Nadeem, Asmar [1 ]
Hilton, Adrian [1 ]
Dawes, Robert [1 ]
Thomas, Graham [2 ]
Mustafa, Armin [1 ]
机构
[1] Univ Surrey, Ctr Vis Speech & Signal Proc CVSSP, Guildford, Surrey, England
[2] BBC Res & Dev, London, England
来源
2024 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION, WACV 2024 | 2024年
基金
英国工程与自然科学研究理事会;
关键词
DIALOG;
D O I
10.1109/WACV57701.2024.00709
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the context of Audio Visual Question Answering (AVQA) tasks, the audio and visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing AVQA methods suffer from two major shortcomings; the audio-visual (AV) information passing through the network isn't aligned on Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic information is often not balanced within a context; this results in poor performance. In this paper, we propose a novel end-to-end Contextual Multi-modal Alignment (CAD) network that addresses the challenges in AVQA methods by i) introducing a parameter-free stochastic Contextual block that ensures robust audio and visual alignment on the Spatial level; ii) proposing a pre-training technique for dynamic audio and visual alignment on Temporal level in a self-supervised setting, and iii) introducing a cross-attention mechanism to balance audio and visual information on Semantic level. The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our proposed contributions to AVQA can be added to the existing methods to improve their performance without additional complexity requirements.
引用
收藏
页码:7236 / 7248
页数:13
相关论文
共 120 条
  • [1] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [2] Akbari H., 2021, P NIPS, P24206
  • [3] Audio Visual Scene-Aware Dialog
    Alamri, Huda
    Cartillier, Vincent
    Das, Abhishek
    Wang, Jue
    Cherian, Anoop
    Essa, Irfan
    Batra, Dhruv
    Marks, Tim K.
    Hori, Chiori
    Anderson, Peter
    Lee, Stefan
    Parikh, Devi
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7550 - 7559
  • [4] Alayrac JB, 2022, ADV NEUR IN
  • [5] Alfasly S, 2022, P IEEE CVF C COMP VI, P20208
  • [6] [Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1002/CPE.7048
  • [7] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [8] Look, Listen and Learn
    Arandjelovic, Relja
    Zisserman, Andrew
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 609 - 617
  • [9] Arandjelovic Relja, 2018, P EUR C COMP VIS ECC, P435
  • [10] Multi-Modal Sarcasm Detection and Humor Classification in Code-Mixed Conversations
    Bedi, Manjot
    Kumar, Shivani
    Akhtar, Md Shad
    Chakraborty, Tanmoy
    [J]. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (02) : 1363 - 1375