Learning to Prompt for Vision-Language Models

被引：875

作者：

Zhou, Kaiyang ^{[1
]}

Yang, Jingkang ^{[1
]}

Loy, Chen Change ^{[1
]}

Liu, Ziwei ^{[1
]}

机构：

[1] Nanyang Technol Univ, S Lab, Singapore, Singapore

来源：

INTERNATIONAL JOURNAL OF COMPUTER VISION | 2022年 / 130卷 / 09期

关键词：

Computational linguistics - Learning systems - Natural language processing systems - Zero-shot learning;

D O I：

10.1007/s11263-022-01653-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming-one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt's context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

引用

页码：2337 / 2348

页数：12

共 50 条

[1] Learning to Prompt for Vision-Language Models
Kaiyang Zhou
Jingkang Yang
Chen Change Loy
Ziwei Liu
International Journal of Computer Vision, 2022, 130 : 2337 - 2348
[2] Conditional Prompt Learning for Vision-Language Models
Zhou, Kaiyang
Yang, Jingkang
Loy, Chen Change
Liu, Ziwei
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16795 - 16804
[3] Consistent prompt learning for vision-language models
Zhang, Yonggang
Tian, Xinmei
KNOWLEDGE-BASED SYSTEMS, 2025, 310
[4] Learning Domain Invariant Prompt for Vision-Language Models
Zhao, Cairong
Wang, Yubin
Jiang, Xinyang
Shen, Yifei
Song, Kaitao
Li, Dongsheng
Miao, Duoqian
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
[5] JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models
Guo, Yuncheng
Guo, Xiaodong
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 28695 - 28705
[6] Adversarial Prompt Tuning for Vision-Language Models
Zhang, Jiaming
Ma, Xingjun
Wang, Xin
Qiu, Lingyu
Wang, Jiaqi
Jiang, Yu-Gang
Sang, Jitao
COMPUTER VISION - ECCV 2024, PT XLV, 2025, 15103 : 56 - 72
[7] Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models
Wang, Yubin
Jiang, Xinyang
Cheng, De
Li, Dongsheng
Zhao, Cairong
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5749 - 5757
[8] Concept-Guided Prompt Learning for Generalization in Vision-Language Models
Zhang, Yi
Zhang, Ce
Yu, Ke
Tang, Yushun
He, Zhihai
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7377 - 7386
[9] Learning to Prompt for Vision-Language Emotion Recognition
Xie, Hongxia
Chung, Hua
Shuai, Hong-Han
Cheng, Wen-Huang
2023 11TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS, ACIIW, 2023,
[10] Task-to-Instance Prompt Learning for Vision-Language Models at Test Time
Lu, Zhihe
Bai, Jiawang
Li, Xin
Xiao, Zeyu
Wang, Xinchao
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 1908 - 1920

← 1 2 3 4 5 →