From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models

被引:67
作者
Guo, Jiaxian [1 ]
Li, Junnan [2 ]
Li, Dongxu [2 ]
Tiong, Anthony Meng Huat [2 ,3 ]
Li, Boyang [3 ]
Tao, Dacheng [1 ]
Hoi, Steven [2 ]
机构
[1] Univ Sydney, Sydney, NSW, Australia
[2] Salesforce Res, San Francisco, CA USA
[3] Nanyang Technol Univ, Singapore, Singapore
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
澳大利亚研究理事会; 新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52729.2023.01046
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnect and task disconnect between the LLM and VQA tasks. End-to-end training on multimodal data may bridge the disconnects, but is inflexible and computationally expensive. To address this issue, we propose Img2LLM, a plug-and-play module that provides LLM prompts to enable LLMs to perform zero-shot VQA tasks without end-to-end training. We develop LLM-agnostic models describe image content as exemplar question-answer pairs, which prove to be effective LLM prompts. Img2LLM offers the following benefits: 1) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo [3] by 5.6% on VQAv2. On the challenging A-OKVQA dataset, our method outperforms few-shot methods by as much as 20%. 2) It flexibly interfaces with a wide range of LLMs to perform VQA. 3) It eliminates the need to specialize LLMs using end-to-end finetuning and serve highly specialized LLMs to end users, thereby reducing cost. Code is available via the LAVIS [28] framework at https://github.com/salesforce/LAVIS/ tree/main/projects/img2llm-vqa.
引用
收藏
页码:10867 / 10877
页数:11
相关论文
共 67 条
[1]   Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].
Agrawal, Aishwarya ;
Batra, Dhruv ;
Parikh, Devi ;
Kembhavi, Aniruddha .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980
[2]  
Akula AR, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P2148
[3]  
Alayrac Jean-Baptiste, 2022, 220414198 ARXIV
[4]  
Anderson P, 2018, PROC CVPR IEEE, P6077, DOI [10.1109/CVPR.2018.00636, 10.1002/ett.70087]
[5]  
[Anonymous], P IEEE INT C COMP VI
[6]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[7]  
Banerjee P, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, P3420
[8]  
Black Sid, 2021, YOU USE THIS SOFTWAR
[9]  
Brown TB, 2020, ADV NEUR IN, V33
[10]  
Changpinyo Soravit, 2022, North American Chapter of the Association for Computational Linguistics, V1