CLIP-Adapter: Better Vision-Language Models with Feature Adapters

被引：302

作者：

Gao, Peng ^{[1
]}

Geng, Shijie ^{[2
]}

Zhang, Renrui ^{[1
]}

Ma, Teli ^{[1
]}

Fang, Rongyao ^{[3
]}

Zhang, Yongfeng ^{[1
,2
]}

Li, Hongsheng ^{[3
]}

Qiao, Yu ^{[1
]}

机构：

[1] Shanghai AI Lab, Shanghai, Peoples R China

[2] Rutgers State Univ, New Brunswick, NJ USA

[3] Chinese Univ Hong Kong, Hong Kong, Peoples R China

来源：

INTERNATIONAL JOURNAL OF COMPUTER VISION | 2024年 / 132卷 / 02期

关键词：

Feature adapter; Vision-language model; Few-shot learning; Open-vocabulary;

D O I：

10.1007/s11263-023-01891-x

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large-scale contrastive vision-language pretraining has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in Radford et al. (International conference on machine learning, PMLR, 2021) to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions. To avoid non-trivial prompt engineering, context optimization (Zhou et al. in Int J Comput Vis 130(9):2337-2348, 2022) has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples. In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning. While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pretrained features. As a consequence, CLIP-Adapter is able to outperform context optimization while maintaining a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.

引用

页码：581 / 595

页数：15

共 70 条

[1]

Alayrac JB, 2022, ADV NEUR IN

[2]

Anderson P, 2018, PROC CVPR IEEE, P6077, DOI [10.1002/ett.70087, 10.1109/CVPR.2018.00636]

[3]

Bossard L, 2014, LECT NOTES COMPUT SC, V8694, P446, DOI 10.1007/978-3-319-10599-4_29

[4]

Brown TB, 2020, ADV NEUR IN, V33

[5] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[6]

Chen J, 2022, ADV NEUR IN

[7]

Chen Yen-Chun., 2020, ECCV

[8] Describing Textures in the Wild [J].

Cimpoi, Mircea ;

Maji, Subhransu ;

Kokkinos, Iasonas ;

Mohamed, Sammy ;

Vedaldi, Andrea .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :3606-3613

[9]

Conneau Alexis., 2020, P 58 ANN M ASS COMP, P8440, DOI [DOI 10.18653/V1/2020.ACL-MAIN.747, 10.18653/v1/2020.acl-main.747]

[10]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

← 1 2 3 4 5 6 7 →