Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space

被引:42
作者
Chavan, Arnav [1 ,3 ]
Shen, Zhiqiang [2 ,3 ]
Liu, Zhuang [4 ]
Liu, Zechun [5 ]
Cheng, Kwang-Ting [6 ]
Xing, Eric [2 ,3 ]
机构
[1] IIT Dhanbad, Dhanbad, Jharkhand, India
[2] CMU, Pittsburgh, PA USA
[3] MBZUAI, Abu Dhabi, U Arab Emirates
[4] Univ Calif Berkeley, Berkeley, CA 94720 USA
[5] Meta Inc, Real Labs, Menlo Pk, CA USA
[6] HKUST, Hong Kong, Peoples R China
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.00488
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper explores the feasibility of finding an optimal sub-model from a vision transformer and introduces a pure vision transformer slimming (ViT-Slim) framework. It can search a sub-structure from the original model end-to-end across multiple dimensions, including the input tokens, MHSA and MLP modules with state-of-the-art performance. Our method is based on a learnable and unified l(1) sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions. The searching process is highly efficient through a single-shot training scheme. For instance, on DeiT-S, ViT-Slim only takes similar to 43 GPU hours for the searching process, and the searched structure is flexible with diverse dimensionalities in different modules. Then, a budget threshold is employed according to the requirements of accuracy-FLOPs trade-off on running devices, and a retraining process is performed to obtain the final model. The extensive experiments show that our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by similar to 0.6% on ImageNet. We also demonstrate the advantage of our searched models on several downstream datasets. Our code is available at https://github.com/Arnav0400/ViT-Slim.
引用
收藏
页码:4921 / 4931
页数:11
相关论文
共 66 条
[1]  
[Anonymous], INT C MACH LEARN
[2]  
Ashraf K., 2016, SQUEEZENET ALEXNET L
[3]  
Bulat A., 2019, BRIT MACH VIS C
[4]  
Cai H, 2018, INT C LEARN REPR
[5]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[6]   Transformer Interpretability Beyond Attention Visualization [J].
Chefer, Hila ;
Gur, Shir ;
Wolf, Lior .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :782-791
[7]   GLiT: Neural Architecture Search for Global and Local Image Transformer [J].
Chen, Boyu ;
Li, Peixia ;
Li, Chuming ;
Li, Baopu ;
Bai, Lei ;
Lin, Chen ;
Sun, Ming ;
Yan, Junjie ;
Ouyang, Wanli .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :12-21
[8]   BN-NAS: Neural Architecture Search with Batch Normalization [J].
Chen, Boyu ;
Li, Peixia ;
Li, Baopu ;
Lin, Chen ;
Li, Chuming ;
Sun, Ming ;
Yan, Junjie ;
Ouyang, Wanli .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :307-316
[9]  
Chen Minghao, 2021, Autoformer: Searching transformers for visual recognition
[10]   Moisture channels and pre-existing weather systems for East Asian rain belts [J].
Cheng, Tat Fan ;
Lu, Mengqian ;
Dai, Lun .
NPJ CLIMATE AND ATMOSPHERIC SCIENCE, 2021, 4 (01)