Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space

被引:42
作者
Chavan, Arnav [1 ,3 ]
Shen, Zhiqiang [2 ,3 ]
Liu, Zhuang [4 ]
Liu, Zechun [5 ]
Cheng, Kwang-Ting [6 ]
Xing, Eric [2 ,3 ]
机构
[1] IIT Dhanbad, Dhanbad, Jharkhand, India
[2] CMU, Pittsburgh, PA USA
[3] MBZUAI, Abu Dhabi, U Arab Emirates
[4] Univ Calif Berkeley, Berkeley, CA 94720 USA
[5] Meta Inc, Real Labs, Menlo Pk, CA USA
[6] HKUST, Hong Kong, Peoples R China
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.00488
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper explores the feasibility of finding an optimal sub-model from a vision transformer and introduces a pure vision transformer slimming (ViT-Slim) framework. It can search a sub-structure from the original model end-to-end across multiple dimensions, including the input tokens, MHSA and MLP modules with state-of-the-art performance. Our method is based on a learnable and unified l(1) sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions. The searching process is highly efficient through a single-shot training scheme. For instance, on DeiT-S, ViT-Slim only takes similar to 43 GPU hours for the searching process, and the searched structure is flexible with diverse dimensionalities in different modules. Then, a budget threshold is employed according to the requirements of accuracy-FLOPs trade-off on running devices, and a retraining process is performed to obtain the final model. The extensive experiments show that our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by similar to 0.6% on ImageNet. We also demonstrate the advantage of our searched models on several downstream datasets. Our code is available at https://github.com/Arnav0400/ViT-Slim.
引用
收藏
页码:4921 / 4931
页数:11
相关论文
共 66 条
[61]   ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices [J].
Zhang, Xiangyu ;
Zhou, Xinyu ;
Lin, Mengxiao ;
Sun, Ran .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6848-6856
[62]   Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [J].
Zheng, Sixiao ;
Lu, Jiachen ;
Zhao, Hengshuang ;
Zhu, Xiatian ;
Luo, Zekun ;
Wang, Yabiao ;
Fu, Yanwei ;
Feng, Jianfeng ;
Xiang, Tao ;
Torr, Philip H. S. ;
Zhang, Li .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :6877-6886
[63]  
Zhong Z, 2020, AAAI CONF ARTIF INTE, V34, P13001
[64]  
Zhou S., 2016, ABS16060 ARXIV
[65]  
Zhu X., 2020, INT C LEARN REPR
[66]   Learning Transferable Architectures for Scalable Image Recognition [J].
Zoph, Barret ;
Vasudevan, Vijay ;
Shlens, Jonathon ;
Le, Quoc V. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8697-8710