Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering

被引：7

作者：

Chen, Zailong ^{[1
]}

Wang, Lei ^{[1
]}

Wang, Peng ^{[2
]}

Gao, Peng ^{[3
]}

机构：

[1] Univ Wollongong, Sch Comp & Informat Technol, Wollongong, NSW 2522, Australia

[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610056, Peoples R China

[3] Beijing Normal Univ Hong Kong Baptist Univ United, Inst Comp Sci, Zhuhai 519000, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 05期

关键词：

Feature extraction; Visualization; Task analysis; Question answering (information retrieval); Data mining; Fuses; Focusing; Audio-visual question answering; video understanding; multimodal learning; deep learning; DIALOG;

D O I：

10.1109/TCSVT.2023.3318220

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

As a newly emerging task, audio-visual question answering (AVQA) has attracted research attention. Compared with traditional single-modality (e.g., audio or visual) QA tasks, it poses new challenges due to the higher complexity of feature extraction and fusion brought by the multimodal inputs. First, AVQA requires more comprehensive understanding of the scene which involves both audio and visual information; Second, in the presence of more information, feature extraction has to be better connected with a given question; Third, features from different modalities need to be sufficiently correlated and fused. To address this situation, this work proposes a novel framework for multimodal question answering task. It characterises an audiovisual scene at both global and local levels, and within each level, the features from different modalities are well fused. Furthermore, the given question is utilised to guide not only the feature extraction at the local level but also the final fusion of global and local features to predict the answer. Our framework provides a new perspective for audio-visual scene understanding through focusing on both general and specific representations as well as aggregating multimodalities by prioritizing question-related information. As experimentally demonstrated, our method significantly improves the existing audio-visual question answering performance, with the averaged absolute gain of 3.3% and 3.1% on MUSIC-AVQA and AVQA datasets, respectively. Moreover, the ablation study verifies the necessity and effectiveness of our design. Our code will be publicly released.

引用

页码：4109 / 4119

页数：11

共 34 条

[1]

Alamri H., 2018, P AAAI WORKSH, V2, P1

[2] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[3] Look, Listen and Learn [J].

Arandjelovic, Relja ;

Zisserman, Andrew .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617

[4] Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering [J].

Fan, Chenyou ;

Zhang, Xiaofan ;

Zhang, Shu ;

Wang, Wensheng ;

Zhang, Chi ;

Huang, Heng .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1999-2007

[5] Temporal Reasoning via Audio Question Answering [J].

Fayek, Haytham M. ;

Johnson, Justin .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 :2283-2294

[6]

Gemmeke JF, 2017, INT CONF ACOUST SPEE, P776, DOI 10.1109/ICASSP.2017.7952261

[7]

Jiang P, 2020, AAAI CONF ARTIF INTE, V34, P11109

[8] Overview of the Eighth Dialog System Technology Challenge: DSTC8 [J].

Kim, Seokhwan ;

Galley, Michel ;

Gunasekara, Chulaka ;

Lee, Sungjin ;

Atkinson, Adam ;

Peng, Baolin ;

Schulz, Hannes ;

Gao, Jianfeng ;

Li, Jinchao ;

Adada, Mahmoud ;

Huang, Minlie ;

Lastras, Luis ;

Kummerfeld, Jonathan K. ;

Lasecki, Walter S. ;

Hori, Chiori ;

Cherian, Anoop ;

Marks, Tim K. ;

Rastogi, Abhinav ;

Zang, Xiaoxue ;

Sunkara, Srinivas ;

Gupta, Raghav .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 (29) :2529-2540

[9]

Li GY, 2023, Arxiv, DOI arXiv:2305.17993

[10] Learning to Answer Questions in Dynamic Audio-Visual Scenarios [J].

Li, Guangyao ;

Wei, Yake ;

Tian, Yapeng ;

Xu, Chenliang ;

Wen, Ji-Rong ;

Hu, Di .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :19086-19096

← 1 2 3 4 →