From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models

被引：67

作者：

Guo, Jiaxian ^{[1
]}

Li, Junnan ^{[2
]}

Li, Dongxu ^{[2
]}

Tiong, Anthony Meng Huat ^{[2
,3
]}

Li, Boyang ^{[3
]}

Tao, Dacheng ^{[1
]}

Hoi, Steven ^{[2
]}

机构：

[1] Univ Sydney, Sydney, NSW, Australia

[2] Salesforce Res, San Francisco, CA USA

[3] Nanyang Technol Univ, Singapore, Singapore

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

澳大利亚研究理事会; 新加坡国家研究基金会;

关键词：

D O I：

10.1109/CVPR52729.2023.01046

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnect and task disconnect between the LLM and VQA tasks. End-to-end training on multimodal data may bridge the disconnects, but is inflexible and computationally expensive. To address this issue, we propose Img2LLM, a plug-and-play module that provides LLM prompts to enable LLMs to perform zero-shot VQA tasks without end-to-end training. We develop LLM-agnostic models describe image content as exemplar question-answer pairs, which prove to be effective LLM prompts. Img2LLM offers the following benefits: 1) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo [3] by 5.6% on VQAv2. On the challenging A-OKVQA dataset, our method outperforms few-shot methods by as much as 20%. 2) It flexibly interfaces with a wide range of LLMs to perform VQA. 3) It eliminates the need to specialize LLMs using end-to-end finetuning and serve highly specialized LLMs to end users, thereby reducing cost. Code is available via the LAVIS [28] framework at https://github.com/salesforce/LAVIS/ tree/main/projects/img2llm-vqa.

引用

页码：10867 / 10877

页数：11

共 67 条

[1] Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].

Agrawal, Aishwarya ;

Batra, Dhruv ;

Parikh, Devi ;

Kembhavi, Aniruddha .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980

[2]

Akula AR, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P2148

[3]

Alayrac Jean-Baptiste, 2022, 220414198 ARXIV

[4]

Anderson P, 2018, PROC CVPR IEEE, P6077, DOI [10.1109/CVPR.2018.00636, 10.1002/ett.70087]

[5]

[Anonymous], P IEEE INT C COMP VI

[6] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[7]

Banerjee P, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, P3420

[8]

Black Sid, 2021, YOU USE THIS SOFTWAR

[9]

Brown TB, 2020, ADV NEUR IN, V33

[10]

Changpinyo Soravit, 2022, North American Chapter of the Association for Computational Linguistics, V1

← 1 2 3 4 5 6 7 →