Lv-Adapter: Adapting Vision Transformers for Visual Classification with Linear-layers and Vectors

被引：0

作者：

Xu, Guangyi ^{[1
]}

Ye, Junyong ^{[1
]}

Liu, Xinyuan ^{[1
]}

Wen, Xubin ^{[1
]}

Li, Youwei ^{[1
]}

Wang, Jingjing ^{[1
]}

机构：

[1] Chongqing Univ, Minist Educ, Key Lab Optoelect Technol & Syst, Chongqing, Peoples R China

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 2024年 / 246卷

关键词：

Deep learning; Vision Transformers; Fine-tuning; Plug and play; Transfer learning;

D O I：

10.1016/j.cviu.2024.104049

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large pre-trained models based on Vision Transformers (ViTs) contain nearly billions of parameters, demanding substantial computational resources and storage space. This restricts their transferability across different tasks. Recent approaches try to use adapter fine-tuning to address this drawback. However, there is still potential to improve the number of tunable parameters and the accuracy in these methods. To address this challenge, we propose an adapter fine-tuning module called Lv-Adapter, which consists of a linear layer and vector. This module enables targeted parameter fine-tuning of pretrained models by learning both the prior knowledge of pre-trained task and the information from downstream specific task, to adapt to various downstream tasks in image and video tasks while transfer learning. Compared to full fine-tuning methods, Lv-Adapter has several appealing advantages. Firstly, by adding only about 3% extra parameters to ViT, Lv-Adapter achieves comparable accuracy to full fine-tuning methods and even significantly surpasses them on action recognition benchmarks. Secondly, Lv-Adapter is a lightweight module that can be plug-and-play in different transformer models due to its simplicity. Finally, to validate these claims, extensive experiments were conducted on five image and video datasets in this study, providing evidence for the effectiveness of Lv-Adapter. When only 3.5% of the extra parameters are updated, it respectively achieves a relative boost of about 13% and 24% compared to the fully fine-tuned model on SSv2 and HMDB51.

引用

页数：12

共 69 条

[51]

Vaswani A, 2017, ADV NEUR IN, V30

[52]

Wang AL, 2019, Arxiv, DOI arXiv:1804.07461

[53] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [J].

Wang, Limin ;

Huang, Bingkun ;

Zhao, Zhiyu ;

Tong, Zhan ;

He, Yinan ;

Wang, Yi ;

Wang, Yali ;

Qiao, Yu .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :14549-14560

[54] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions [J].

Wang, Wenhai ;

Dai, Jifeng ;

Chen, Zhe ;

Huang, Zhenhang ;

Li, Zhiqi ;

Zhu, Xizhou ;

Hu, Xiaowei ;

Lu, Tong ;

Lu, Lewei ;

Li, Hongsheng ;

Wang, Xiaogang ;

Qiao, Yu .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :14408-14419

[55] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [J].

Wang, Wenhai ;

Xie, Enze ;

Li, Xiang ;

Fan, Deng-Ping ;

Song, Kaitao ;

Liang, Ding ;

Lu, Tong ;

Luo, Ping ;

Shao, Ling .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :548-558

[56]

Wasim ST, 2023, Arxiv, DOI arXiv:2304.03307

[57]

Wei C., 2021, arXiv

[58]

Xie C., 2022, ICLR

[59]

Yang Zhilin, 2019, XLNet: Generalized autoregressive pretraining for language understand

[60]

Yao HJ, 2023, Arxiv, DOI arXiv:2311.15769

← 1 2 3 4 5 6 7 →