Visual Prompt Tuning

被引:768
作者
Jia, Menglin [1 ,2 ]
Tang, Luming [1 ]
Chen, Bor-Chun [2 ]
Cardie, Claire [1 ]
Belongie, Serge [3 ]
Hariharan, Bharath [1 ]
Lim, Ser-Nam [2 ]
机构
[1] Cornell Univ, Ithaca, NY 14850 USA
[2] Meta AI, New York, NY 10003 USA
[3] Univ Copenhagen, Copenhagen, Denmark
来源
COMPUTER VISION - ECCV 2022, PT XXXIII | 2022年 / 13693卷
关键词
D O I
10.1007/978-3-031-19827-4_41
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, i.e., full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost. Code is available at github.com/kmnp/vpt.
引用
收藏
页码:709 / 727
页数:19
相关论文
共 71 条
[1]  
[Anonymous], 2009, CVPR
[2]  
Ba J. L., 2016, arXiv, DOI 10.48550/arXiv:1607.06450
[3]  
Bahng H, 2022, Arxiv, DOI arXiv:2203.17274
[4]  
Ben-Zaken E, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, P1
[5]  
Bommasani R., 2022, On the opportunities and risks of foundation models, DOI [10.48550/arXiv.2108.07258, DOI 10.48550/ARXIV.2108.07258]
[6]  
Brown TB, 2020, ADV NEUR IN, V33
[7]  
Cai H, 2020, ADV NEUR IN, V33
[8]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[9]   Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation [J].
Chen, Liang-Chieh ;
Zhu, Yukun ;
Papandreou, George ;
Schroff, Florian ;
Adam, Hartwig .
COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :833-851
[10]   An Empirical Study of Training Self-Supervised Vision Transformers [J].
Chen, Xinlei ;
Xie, Saining ;
He, Kaiming .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9620-9629