GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

被引：0

作者：

Li, Yi-Ting ^{[1
]}

Lin, Ying-Jia ^{[1
]}

Yeh, Chia-Jen ^{[1
]}

Lin, Chun-Yi ^{[1
]}

Kao, Hung-Yu ^{[1
]}

机构：

[1] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan, Taiwan

来源：

ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT VI, PAKDD 2024 | 2024年 / 14650卷

关键词：

Visual Question Answering; Visual Grounding; Prompt Tuning;

D O I：

10.1007/978-981-97-2266-2_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The WSDM 2023 Toloka VQA challenge introduces a new Grounding-based Visual Question Answering (GVQA) dataset, elevating multimodal task complexity. This challenge diverges from traditional VQA by requiring models to identify a bounding box in response to an image-question pair, aligning with Visual Grounding tasks. Existing VG approaches, when applied to GVQA, often necessitate external data or larger models for satisfactory results, leading to high computational demands. We approach this as a language modeling problem, utilizing prompt tuning with multiple state-of-the-art VQA models. Our method, operating solely on an NVIDIA RTX3090 GPU without external data, secured third place in the challenge, achieving an Intersection over Union (IoU) of 75.658. Our model notably provides explainability between textual and visual data through its attention mechanism, offering insights into its decision-making process. This research demonstrates that high performance in GVQA can be achieved with minimal resources, enhancing understanding of model dynamics and paving the way for improved interpretability and efficiency. Our code is available here: https://github. com/IKMLab/GViG.git

引用

页码：83 / 94

页数：12

共 50 条

[1] Multimodal Prompt Retrieval for Generative Visual Question Answering
Ossowski, Timothy
Hu, Junjie
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 2518 - 2535
[2] Enhancing Visual Question Answering with Prompt-based Learning: A Cross-modal Approach for Deep Semantic Understanding
Zhu, Shuaiyu
Peng, Shuo
Chen, Shengbo
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ALGORITHMS, SOFTWARE ENGINEERING, AND NETWORK SECURITY, ASENS 2024, 2024, : 713 - 717
[3] Gotta: Generative Few-shot Question Answering by Prompt-based Cloze Data Augmentation
Chen, Xiusi
Zhang, Yu
Deng, Jinliang
Jiang, Jyun-Yu
Wang, Wei
PROCEEDINGS OF THE 2023 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2023, : 909 - 917
[4] Generative Bias for Robust Visual Question Answering
Cho, Jae Won
Kim, Dong-Jin
Ryu, Hyeonggon
Kweon, In So
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11681 - 11690
[5] Prompt-RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question Answering
Chappuis, Christel
Zermatten, Valerie
Lobry, Sylvain
Le Saux, Bertrand
Tuia, Devis
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 1371 - 1380
[6] LANGUAGE AND VISUAL RELATIONS ENCODING FOR VISUAL QUESTION ANSWERING
Liu, Fei
Liu, Jing
Fang, Zhiwei
Lu, Hanqing
2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 3307 - 3311
[7] Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
Zhang, Yundong
Niebles, Juan Carlos
Soto, Alvaro
2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 349 - 357
[8] HybridPrompt: Bridging Language Models and Human Priors in Prompt Tuning for Visual Question Answering
Ma, Zhiyuan
Yu, Zhihuan
Li, Jianjun
Li, Guohui
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13371 - 13379
[9] Generative Models in Medical Visual Question Answering: A Survey
Dong, Wenjie
Shen, Shuhao
Han, Yuqiang
Tan, Tao
Wu, Jian
Xu, Hongxia
APPLIED SCIENCES-BASEL, 2025, 15 (06):
[10] Improving Visual Question Answering with Pre-trained Language Modeling
Wu, Yue
Gao, Huiyi
Chen, Lei
FIFTH INTERNATIONAL WORKSHOP ON PATTERN RECOGNITION, 2020, 11526

← 1 2 3 4 5 →