Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning

被引：18

作者：

Tu, Cheng-Hao ^{[1
]}

Mai, Zheda ^{[1
]}

Chao, Wei-Lun ^{[1
]}

机构：

[1] Ohio State Univ, Columbus, OH USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.00746

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Intermediate features of a pre-trained model have been shown informative for making accurate predictions on downstream tasks, even if the model backbone is kept frozen. The key challenge is how to utilize these intermediate features given their gigantic amount. We propose visual query tuning (VQT), a simple yet effective approach to aggregate intermediate features of Vision Transformers. Through introducing a handful of learnable "query" tokens to each layer, VQT leverages the inner workings of Transformers to "summarize" rich intermediate features of each layer, which can then be used to train the prediction heads of downstream tasks. As VQT keeps the intermediate features intact and only learns to combine them, it enjoys memory efficiency in training, compared to many other parameter-efficient fine-tuning approaches that learn to adapt features and need back-propagation through the entire backbone. This also suggests the complementary role between VQT and those approaches in transfer learning. Empirically, VQT consistently surpasses the state-of-the-art approach that utilizes intermediate features for transfer learning and outperforms full fine-tuning in many cases. Compared to parameter-efficient approaches that adapt features, VQT achieves much higher accuracy under memory constraints. Most importantly, VQT is compatible with these approaches to attain even higher accuracy, making it a simple addon to further boost transfer learning. Code is available at https://github.com/andytu28/VQT.

引用

页码：7725 / 7735

页数：11

共 59 条

[1]

Argyriou A., 2006, Advances in neural information processing systems, V19

[2] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[3]

Asai Akari, 2022, ARXIV220511961

[4] Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks [J].

Bell, Sean ;

Zitnick, C. Lawrence ;

Bala, Kavita ;

Girshick, Ross .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2874-2883

[5]

Ben-Zaken E, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, P1

[6]

Brown Tom B., 2020, Language Models Are FewShot Learners., DOI [DOI 10.5555/3495724.3495883, 10.5555/3495724.3495883]

[7]

Carion N, 2020, Img Proc Comp Vis Re, V12346, P213, DOI 10.1007/978-3-030-58452-8_13

[8] Emerging Properties in Self-Supervised Vision Transformers [J].

Caron, Mathilde ;

Touvron, Hugo ;

Misra, Ishan ;

Jegou, Herve ;

Mairal, Julien ;

Bojanowski, Piotr ;

Joulin, Armand .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640

[9] Pre-Trained Image Processing Transformer [J].

Chen, Hanting ;

Wang, Yunhe ;

Guo, Tianyu ;

Xu, Chang ;

Deng, Yiping ;

Liu, Zhenhua ;

Ma, Siwei ;

Xu, Chunjing ;

Xu, Chao ;

Gao, Wen .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12294-12305

[10]

Chen Shoufa, 2022, ARXIV220513535

← 1 2 3 4 5 6 →