AVQA: A Dataset for Audio-Visual Question Answering on Videos

被引：20

作者：

Yang, Pinci ^{[1
]}

Wang, Xin ^{[2
]}

Duan, Xuguang ^{[2
]}

Chen, Hong ^{[2
]}

Hou, Runze ^{[1
]}

Jin, Cong ^{[3
]}

Zhu, Wenwu ^{[1
]}

机构：

[1] Tsinghua Univ, Tsinghua Berkeley Shenzhen Inst, Beijing, Peoples R China

[2] Tsinghua Univ, Beijing, Peoples R China

[3] Commun Univ China, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

dataset; audio-visual question answering; multimodal;

D O I：

10.1145/3503161.3548291

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Audio-visual question answering aims to answer questions regarding both audio and visual modalities in a given video, and has drawn increasing research interest in recent years. However, there have been no appropriate datasets for this challenging task on videos in real-life scenarios so far. They are either designed with questions containing only visual clues without taking any audio information into account, or considering audio with restrictions to specific scenarios, such as panoramic videos and videos about music performances. In this paper, to overcome the limitations of existing datasets, we introduce AVQA, a new audio-visual question answering dataset on videos in real-life scenarios. We collect 57,015 videos from daily audio-visual activities and 57,335 specially-designed question-answer pairs relying on clues from both modalities, where information contained in a single modality is insufficient or ambiguous. Furthermore, we propose a Hierarchical Audio-Visual Fusing module to model multiple semantic correlations among audio, visual, and text modalities and conduct ablation studies to analyze the role of different modalities on our datasets. Experimental results show that our proposed method significantly improves the audio-visual question answering performance over various question types. Therefore, AVQA can provide an adequate testbed for the generation of models with a deeper understanding of multimodal information on audio-visual question answering in real-life scenarios. (The dataset is available at https://mn.cs.tsinghua.edu.cn/avqa)

引用

页码：3480 / 3491

页数：12

共 48 条

[1] Seeing sounds: visual and auditory interactions in the brain
Bulkin, David A.
Groh, Jennifer M.
[J]. CURRENT OPINION IN NEUROBIOLOGY, 2006, 16 (04) : 415 - 419
[2] Castro S, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P4352
[3] Chen HL, 2020, INT CONF ACOUST SPEE, P721, DOI [10.1109/ICASSP40776.2020.9053174, 10.1109/icassp40776.2020.9053174]
[4] Choi S, 2020, ARXIV200503356
[5] Colas Anthony, 2019, ARXIV191201046
[6] Duan X, 2018, ADV NEUR IN, V31
[7] Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
Fan, Chenyou
Zhang, Xiaofan
Zhang, Shu
Wang, Wensheng
Zhang, Chi
Huang, Heng
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1999 - 2007
[8] Motion-Appearance Co-Memory Networks for Video Question Answering
Gao, Jiyang
Ge, Runzhou
Chen, Kan
Nevatia, Ram
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6576 - 6585
[9] Garcia N, 2020, AAAI CONF ARTIF INTE, V34, P10826
[10] AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
Grunde-McLaughlin, Madeleine
Krishna, Ranjay
Agrawala, Maneesh
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11282 - 11292

← 1 2 3 4 5 →