Meta-prompt tuning for low-resource visual question answering

被引：0

作者：

Shao, Mingwen ^{[1
,2
]}

Liu, Yuanyuan ^{[1
,2
]}

Meng, Lingzhuang ^{[2
]}

Shao, Xun ^{[2
]}

机构：

[1] Quanzhou Vocat & Tech Univ, Joint Innovat Ind Coll, Jinjiang 362000, Fujian, Peoples R China

[2] China Univ Petr East China, Coll Comp Sci & Technol, Changjiang Rd, Qingdao 266580, Shandong, Peoples R China

来源：

MULTIMEDIA SYSTEMS | 2025年 / 31卷 / 04期

关键词：

Low-resource visual question answering; Instruction tuning; Prompt tuning; Meta learning; Dynamic routing;

D O I：

10.1007/s00530-025-01854-x

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recently, fine-tuning pre-trained Vision Language Models (VLM) has achieved significant success in Low-resource Visual Question Answering (LVQA) tasks. However, existing works have failed to match questions with their corresponding image features, resulting in reduced question-answering accuracy due to the lack of detailed analysis of fine-grained features. To mitigate these issues, we propose a Meta-Prompt Tuning (MPT) approach in this paper, which aims to enable the model to understand and analyze diverse questions by incorporating relevant image regions, thereby producing accurate answers using a limited amount of data. Specifically, in order to enhance the model's ability to handle information from images and questions, we have devised a dual-loop training framework. In the inner loop, specific instructions are provided to assist the model in processing different types of questions, while in the outer loop, the model accumulates general knowledge across various question-answer pairs. Furthermore, to perform a detailed analysis of the current question and focus on relevant visual features, we have designed Meta-prompt Generation modules and Dynamic Routers to parse input content information and dynamically combine meta-prompts according to requirements. Experimental results on standard LVQA datasets demonstrate the effectiveness of our proposed method compared to other approaches, and the accuracy across various question-answer types has improved significantly.

引用

页数：18

共 48 条

[1]

Alayrac JB, 2022, ADV NEUR IN

[2] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[3]

Awadalla A, 2023, Arxiv, DOI [arXiv:2308.01390, DOI 10.48550/ARXIV.2308.01390]

[4]

Baik S, 2020, ADV NEUR IN, V33

[5]

Ben-Zaken E, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, P1

[6]

Brown TB, 2020, ADV NEUR IN, V33

[7]

Cho J, 2021, PR MACH LEARN RES, V139

[8]

Finn C, 2017, PR MACH LEARN RES, V70

[9] Physically Grounded Vision-Language Models for Robotic Manipulation [J].

Gao, Jensen ;

Sarkar, Bidipta ;

Xia, Fei ;

Xiao, Ted ;

Wu, Jiajun ;

Ichter, Brian ;

Majumdar, Anirudha ;

Sadigh, Dorsa .

2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2024), 2024, :12462-12469

[10] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].

Goyal, Yash ;

Khot, Tejas ;

Summers-Stay, Douglas ;

Batra, Dhruv ;

Parikh, Devi .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6325-6334

← 1 2 3 4 5 →