AVQA: A Dataset for Audio-Visual Question Answering on Videos

被引:20
作者
Yang, Pinci [1 ]
Wang, Xin [2 ]
Duan, Xuguang [2 ]
Chen, Hong [2 ]
Hou, Runze [1 ]
Jin, Cong [3 ]
Zhu, Wenwu [1 ]
机构
[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Beijing, Peoples R China
[2] Tsinghua Univ, Beijing, Peoples R China
[3] Commun Univ China, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
基金
中国国家自然科学基金;
关键词
dataset; audio-visual question answering; multimodal;
D O I
10.1145/3503161.3548291
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Audio-visual question answering aims to answer questions regarding both audio and visual modalities in a given video, and has drawn increasing research interest in recent years. However, there have been no appropriate datasets for this challenging task on videos in real-life scenarios so far. They are either designed with questions containing only visual clues without taking any audio information into account, or considering audio with restrictions to specific scenarios, such as panoramic videos and videos about music performances. In this paper, to overcome the limitations of existing datasets, we introduce AVQA, a new audio-visual question answering dataset on videos in real-life scenarios. We collect 57,015 videos from daily audio-visual activities and 57,335 specially-designed question-answer pairs relying on clues from both modalities, where information contained in a single modality is insufficient or ambiguous. Furthermore, we propose a Hierarchical Audio-Visual Fusing module to model multiple semantic correlations among audio, visual, and text modalities and conduct ablation studies to analyze the role of different modalities on our datasets. Experimental results show that our proposed method significantly improves the audio-visual question answering performance over various question types. Therefore, AVQA can provide an adequate testbed for the generation of models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios. (The dataset is available at https://mn.cs.tsinghua.edu.cn/avqa)
引用
收藏
页码:3480 / 3491
页数:12
相关论文
共 48 条
  • [1] Seeing sounds: visual and auditory interactions in the brain
    Bulkin, David A.
    Groh, Jennifer M.
    [J]. CURRENT OPINION IN NEUROBIOLOGY, 2006, 16 (04) : 415 - 419
  • [2] Castro S, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P4352
  • [3] Chen HL, 2020, INT CONF ACOUST SPEE, P721, DOI [10.1109/ICASSP40776.2020.9053174, 10.1109/icassp40776.2020.9053174]
  • [4] Choi S, 2020, ARXIV200503356
  • [5] Colas Anthony, 2019, ARXIV191201046
  • [6] Duan X, 2018, ADV NEUR IN, V31
  • [7] Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
    Fan, Chenyou
    Zhang, Xiaofan
    Zhang, Shu
    Wang, Wensheng
    Zhang, Chi
    Huang, Heng
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1999 - 2007
  • [8] Motion-Appearance Co-Memory Networks for Video Question Answering
    Gao, Jiyang
    Ge, Runzhou
    Chen, Kan
    Nevatia, Ram
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6576 - 6585
  • [9] Garcia N, 2020, AAAI CONF ARTIF INTE, V34, P10826
  • [10] AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
    Grunde-McLaughlin, Madeleine
    Krishna, Ranjay
    Agrawala, Maneesh
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11282 - 11292