CLIP-Adapter: Better Vision-Language Models with Feature Adapters

被引:231
作者
Gao, Peng [1 ]
Geng, Shijie [2 ]
Zhang, Renrui [1 ]
Ma, Teli [1 ]
Fang, Rongyao [3 ]
Zhang, Yongfeng [1 ,2 ]
Li, Hongsheng [3 ]
Qiao, Yu [1 ]
机构
[1] Shanghai AI Lab, Shanghai, Peoples R China
[2] Rutgers State Univ, New Brunswick, NJ USA
[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China
关键词
Feature adapter; Vision-language model; Few-shot learning; Open-vocabulary;
D O I
10.1007/s11263-023-01891-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale contrastive vision-language pretraining has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in Radford et al. (International conference on machine learning, PMLR, 2021) to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions. To avoid non-trivial prompt engineering, context optimization (Zhou et al. in Int J Comput Vis 130(9):2337-2348, 2022) has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples. In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning. While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pretrained features. As a consequence, CLIP-Adapter is able to outperform context optimization while maintaining a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
引用
收藏
页码:581 / 595
页数:15
相关论文
共 70 条
[1]  
Alayrac JB, 2022, ADV NEUR IN
[2]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[3]  
[Anonymous], 2012, CVPR
[4]  
Bossard L, 2014, LECT NOTES COMPUT SC, V8694, P446, DOI 10.1007/978-3-319-10599-4_29
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]  
Carion N, 2020, Img Proc Comp Vis Re, V12346, P213, DOI 10.1007/978-3-030-58452-8_13
[7]  
Chen J, 2022, ADV NEUR IN
[8]  
Chen Yen-Chun, 2020, ECCV
[9]   Describing Textures in the Wild [J].
Cimpoi, Mircea ;
Maji, Subhransu ;
Kokkinos, Iasonas ;
Mohamed, Sammy ;
Vedaldi, Andrea .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :3606-3613
[10]  
Conneau A., 2020, ACL, P8440, DOI DOI 10.18653/V1/2020.ACL-MAIN.747