ZPVQA: Visual Question Answering of Images Based on Zero-Shot Prompt Learning

被引：1

作者：

Hu, Naihao ^{[1
]}

Zhang, Xiaodan ^{[1
,2
]}

Zhang, Qiyuan ^{[1
]}

Huo, Wei ^{[1
]}

You, Shaojie ^{[1
]}

机构：

[1] Qinghai Univ, Dept Comp Technol & Applicat, Xining 810016, Qinghai, Peoples R China

[2] Qinghai Univ, Qinghai Prov Lab Intelligent Comp & Applicat, Xining 810016, Qinghai, Peoples R China

来源：

IEEE ACCESS | 2025年 / 13卷

关键词：

Visualization; Training; Data models; Context modeling; Transformers; Question answering (information retrieval); Linguistics; Cognition; Learning systems; Zero shot learning; Zero-shot learning; prompt learning; visual question and answer; large language models;

D O I：

10.1109/ACCESS.2025.3550942

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, the use of zero-shot learning to solve visual question-answering (VQA) problems has become a common strategy to address the challenges of complex interactions between visual and verbal modalities. Despite the significant progress of large-scale language models (LLMs) in language tasks, their application to visual question-answering tasks is still challenging because of the differences between visual and textual data. To alleviate this problem, we propose the zero-shot prompt learning VQA (ZPVQA) model, which mitigates the differences between visual and textual data through a prompt-based reasoning strategy and reduces the dependence on end-to-end training. The model devises a method of prompt learning by means of designed prompts that enable LLMs to generate caption prompts based on images and then combines the images with the generated caption prompts in order for the LLMs to generate questions and answers related to the images. In this study, we tested the performance of the ZPVQA model on multiple datasets, and achieve a performance improvement of 3.4% on the VQAv2 dataset and 2.6% on the OK-VQA dataset. The experimental results demonstrate that the prompt learning mechanism designed in this study improves the performance of the model in handling complex multi-modal tasks.

引用

页码：50849 / 50859

页数：11

共 42 条

[1]

Alayrac JB, 2022, ADV NEUR IN

[2] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[3]

Brown TB, 2020, Arxiv, DOI [arXiv:2005.14165, DOI 10.48550/ARXIV.2005.14165]

[4]

Banerjee P, 2021, Arxiv, DOI arXiv:2012.02356

[5]

Black S., 2021, Softw., Metadata, V58

[6]

Bommasani R., 2021, arXiv

[7]

Changpinyo S, 2022, Arxiv, DOI arXiv:2205.01883

[8] Prompt-RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question Answering [J].

Chappuis, Christel ;

Zermatten, Valerie ;

Lobry, Sylvain ;

Le Saux, Bertrand ;

Tuia, Devis .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, :1371-1380

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10]

Du YF, 2023, Arxiv, DOI [arXiv:2305.17006, 10.48550/arXiv.2305.17006]

← 1 2 3 4 5 →