GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

被引:0
|
作者
Li, Yi-Ting [1 ]
Lin, Ying-Jia [1 ]
Yeh, Chia-Jen [1 ]
Lin, Chun-Yi [1 ]
Kao, Hung-Yu [1 ]
机构
[1] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan, Taiwan
来源
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT VI, PAKDD 2024 | 2024年 / 14650卷
关键词
Visual Question Answering; Visual Grounding; Prompt Tuning;
D O I
10.1007/978-981-97-2266-2_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The WSDM 2023 Toloka VQA challenge introduces a new Grounding-based Visual Question Answering (GVQA) dataset, elevating multimodal task complexity. This challenge diverges from traditional VQA by requiring models to identify a bounding box in response to an image-question pair, aligning with Visual Grounding tasks. Existing VG approaches, when applied to GVQA, often necessitate external data or larger models for satisfactory results, leading to high computational demands. We approach this as a language modeling problem, utilizing prompt tuning with multiple state-of-the-art VQA models. Our method, operating solely on an NVIDIA RTX3090 GPU without external data, secured third place in the challenge, achieving an Intersection over Union (IoU) of 75.658. Our model notably provides explainability between textual and visual data through its attention mechanism, offering insights into its decision-making process. This research demonstrates that high performance in GVQA can be achieved with minimal resources, enhancing understanding of model dynamics and paving the way for improved interpretability and efficiency. Our code is available here: https://github. com/IKMLab/GViG.git
引用
收藏
页码:83 / 94
页数:12
相关论文
共 50 条
  • [1] Multimodal Prompt Retrieval for Generative Visual Question Answering
    Ossowski, Timothy
    Hu, Junjie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 2518 - 2535
  • [2] Enhancing Visual Question Answering with Prompt-based Learning: A Cross-modal Approach for Deep Semantic Understanding
    Zhu, Shuaiyu
    Peng, Shuo
    Chen, Shengbo
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ALGORITHMS, SOFTWARE ENGINEERING, AND NETWORK SECURITY, ASENS 2024, 2024, : 713 - 717
  • [3] Gotta: Generative Few-shot Question Answering by Prompt-based Cloze Data Augmentation
    Chen, Xiusi
    Zhang, Yu
    Deng, Jinliang
    Jiang, Jyun-Yu
    Wang, Wei
    PROCEEDINGS OF THE 2023 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2023, : 909 - 917
  • [4] Generative Bias for Robust Visual Question Answering
    Cho, Jae Won
    Kim, Dong-Jin
    Ryu, Hyeonggon
    Kweon, In So
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11681 - 11690
  • [5] Prompt-RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question Answering
    Chappuis, Christel
    Zermatten, Valerie
    Lobry, Sylvain
    Le Saux, Bertrand
    Tuia, Devis
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 1371 - 1380
  • [6] LANGUAGE AND VISUAL RELATIONS ENCODING FOR VISUAL QUESTION ANSWERING
    Liu, Fei
    Liu, Jing
    Fang, Zhiwei
    Lu, Hanqing
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 3307 - 3311
  • [7] Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
    Zhang, Yundong
    Niebles, Juan Carlos
    Soto, Alvaro
    2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 349 - 357
  • [8] HybridPrompt: Bridging Language Models and Human Priors in Prompt Tuning for Visual Question Answering
    Ma, Zhiyuan
    Yu, Zhihuan
    Li, Jianjun
    Li, Guohui
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13371 - 13379
  • [9] Generative Models in Medical Visual Question Answering: A Survey
    Dong, Wenjie
    Shen, Shuhao
    Han, Yuqiang
    Tan, Tao
    Wu, Jian
    Xu, Hongxia
    APPLIED SCIENCES-BASEL, 2025, 15 (06):
  • [10] Improving Visual Question Answering with Pre-trained Language Modeling
    Wu, Yue
    Gao, Huiyi
    Chen, Lei
    FIFTH INTERNATIONAL WORKSHOP ON PATTERN RECOGNITION, 2020, 11526