Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space

被引：42

作者：

Chavan, Arnav ^{[1
,3
]}

Shen, Zhiqiang ^{[2
,3
]}

Liu, Zhuang ^{[4
]}

Liu, Zechun ^{[5
]}

Cheng, Kwang-Ting ^{[6
]}

Xing, Eric ^{[2
,3
]}

机构：

[1] IIT Dhanbad, Dhanbad, Jharkhand, India

[2] CMU, Pittsburgh, PA USA

[3] MBZUAI, Abu Dhabi, U Arab Emirates

[4] Univ Calif Berkeley, Berkeley, CA 94720 USA

[5] Meta Inc, Real Labs, Menlo Pk, CA USA

[6] HKUST, Hong Kong, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00488

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper explores the feasibility of finding an optimal sub-model from a vision transformer and introduces a pure vision transformer slimming (ViT-Slim) framework. It can search a sub-structure from the original model end-to-end across multiple dimensions, including the input tokens, MHSA and MLP modules with state-of-the-art performance. Our method is based on a learnable and unified l(1) sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions. The searching process is highly efficient through a single-shot training scheme. For instance, on DeiT-S, ViT-Slim only takes similar to 43 GPU hours for the searching process, and the searched structure is flexible with diverse dimensionalities in different modules. Then, a budget threshold is employed according to the requirements of accuracy-FLOPs trade-off on running devices, and a retraining process is performed to obtain the final model. The extensive experiments show that our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by similar to 0.6% on ImageNet. We also demonstrate the advantage of our searched models on several downstream datasets. Our code is available at https://github.com/Arnav0400/ViT-Slim.

引用

页码：4921 / 4931

页数：11

共 66 条

[61] ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices [J].

Zhang, Xiangyu ;

Zhou, Xinyu ;

Lin, Mengxiao ;

Sun, Ran .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6848-6856

[62] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [J].

Zheng, Sixiao ;

Lu, Jiachen ;

Zhao, Hengshuang ;

Zhu, Xiatian ;

Luo, Zekun ;

Wang, Yabiao ;

Fu, Yanwei ;

Feng, Jianfeng ;

Xiang, Tao ;

Torr, Philip H. S. ;

Zhang, Li .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :6877-6886

[63]

Zhong Z, 2020, AAAI CONF ARTIF INTE, V34, P13001

[64]

Zhou S., 2016, ABS16060 ARXIV

[65]

Zhu X., 2020, INT C LEARN REPR

[66] Learning Transferable Architectures for Scalable Image Recognition [J].

Zoph, Barret ;

Vasudevan, Vijay ;

Shlens, Jonathon ;

Le, Quoc V. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8697-8710

← 1 2 3 4 5 6 7 →