CLIP-Adapter: Better Vision-Language Models with Feature Adapters

被引:302
作者
Gao, Peng [1 ]
Geng, Shijie [2 ]
Zhang, Renrui [1 ]
Ma, Teli [1 ]
Fang, Rongyao [3 ]
Zhang, Yongfeng [1 ,2 ]
Li, Hongsheng [3 ]
Qiao, Yu [1 ]
机构
[1] Shanghai AI Lab, Shanghai, Peoples R China
[2] Rutgers State Univ, New Brunswick, NJ USA
[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China
关键词
Feature adapter; Vision-language model; Few-shot learning; Open-vocabulary;
D O I
10.1007/s11263-023-01891-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale contrastive vision-language pretraining has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in Radford et al. (International conference on machine learning, PMLR, 2021) to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions. To avoid non-trivial prompt engineering, context optimization (Zhou et al. in Int J Comput Vis 130(9):2337-2348, 2022) has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples. In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning. While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pretrained features. As a consequence, CLIP-Adapter is able to outperform context optimization while maintaining a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
引用
收藏
页码:581 / 595
页数:15
相关论文
共 70 条
[1]  
Alayrac JB, 2022, ADV NEUR IN
[2]  
Anderson P, 2018, PROC CVPR IEEE, P6077, DOI [10.1002/ett.70087, 10.1109/CVPR.2018.00636]
[3]  
Bossard L, 2014, LECT NOTES COMPUT SC, V8694, P446, DOI 10.1007/978-3-319-10599-4_29
[4]  
Brown TB, 2020, ADV NEUR IN, V33
[5]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[6]  
Chen J, 2022, ADV NEUR IN
[7]  
Chen Yen-Chun., 2020, ECCV
[8]   Describing Textures in the Wild [J].
Cimpoi, Mircea ;
Maji, Subhransu ;
Kokkinos, Iasonas ;
Mohamed, Sammy ;
Vedaldi, Andrea .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :3606-3613
[9]  
Conneau Alexis., 2020, P 58 ANN M ASS COMP, P8440, DOI [DOI 10.18653/V1/2020.ACL-MAIN.747, 10.18653/v1/2020.acl-main.747]
[10]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848