Visual Prompt Tuning

被引：768

作者：

Jia, Menglin ^{[1
,2
]}

Tang, Luming ^{[1
]}

Chen, Bor-Chun ^{[2
]}

Cardie, Claire ^{[1
]}

Belongie, Serge ^{[3
]}

Hariharan, Bharath ^{[1
]}

Lim, Ser-Nam ^{[2
]}

机构：

[1] Cornell Univ, Ithaca, NY 14850 USA

[2] Meta AI, New York, NY 10003 USA

[3] Univ Copenhagen, Copenhagen, Denmark

来源：

COMPUTER VISION - ECCV 2022, PT XXXIII | 2022年 / 13693卷

关键词：

D O I：

10.1007/978-3-031-19827-4_41

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, i.e., full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost. Code is available at github.com/kmnp/vpt.

引用

页码：709 / 727

页数：19

共 71 条

[1]

[Anonymous], 2009, CVPR

[2]

Ba J. L., 2016, arXiv, DOI 10.48550/arXiv:1607.06450

[3]

Bahng H, 2022, Arxiv, DOI arXiv:2203.17274

[4]

Ben-Zaken E, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, P1

[5]

Bommasani R., 2022, On the opportunities and risks of foundation models, DOI [10.48550/arXiv.2108.07258, DOI 10.48550/ARXIV.2108.07258]

[6]

Brown TB, 2020, ADV NEUR IN, V33

[7]

Cai H, 2020, ADV NEUR IN, V33

[8] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[9] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation [J].

Chen, Liang-Chieh ;

Zhu, Yukun ;

Papandreou, George ;

Schroff, Florian ;

Adam, Hartwig .

COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :833-851

[10] An Empirical Study of Training Self-Supervised Vision Transformers [J].

Chen, Xinlei ;

Xie, Saining ;

He, Kaiming .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9620-9629

← 1 2 3 4 5 6 7 8 →