Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space

被引：42

作者：

Chavan, Arnav ^{[1
,3
]}

Shen, Zhiqiang ^{[2
,3
]}

Liu, Zhuang ^{[4
]}

Liu, Zechun ^{[5
]}

Cheng, Kwang-Ting ^{[6
]}

Xing, Eric ^{[2
,3
]}

机构：

[1] IIT Dhanbad, Dhanbad, Jharkhand, India

[2] CMU, Pittsburgh, PA USA

[3] MBZUAI, Abu Dhabi, U Arab Emirates

[4] Univ Calif Berkeley, Berkeley, CA 94720 USA

[5] Meta Inc, Real Labs, Menlo Pk, CA USA

[6] HKUST, Hong Kong, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00488

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper explores the feasibility of finding an optimal sub-model from a vision transformer and introduces a pure vision transformer slimming (ViT-Slim) framework. It can search a sub-structure from the original model end-to-end across multiple dimensions, including the input tokens, MHSA and MLP modules with state-of-the-art performance. Our method is based on a learnable and unified l(1) sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions. The searching process is highly efficient through a single-shot training scheme. For instance, on DeiT-S, ViT-Slim only takes similar to 43 GPU hours for the searching process, and the searched structure is flexible with diverse dimensionalities in different modules. Then, a budget threshold is employed according to the requirements of accuracy-FLOPs trade-off on running devices, and a retraining process is performed to obtain the final model. The extensive experiments show that our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by similar to 0.6% on ImageNet. We also demonstrate the advantage of our searched models on several downstream datasets. Our code is available at https://github.com/Arnav0400/ViT-Slim.

引用

页码：4921 / 4931

页数：11

共 66 条

[1]

[Anonymous], INT C MACH LEARN

[2]

Ashraf K., 2016, SQUEEZENET ALEXNET L

[3]

Bulat A., 2019, BRIT MACH VIS C

[4]

Cai H, 2018, INT C LEARN REPR

[5] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[6] Transformer Interpretability Beyond Attention Visualization [J].

Chefer, Hila ;

Gur, Shir ;

Wolf, Lior .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :782-791

[7] GLiT: Neural Architecture Search for Global and Local Image Transformer [J].

Chen, Boyu ;

Li, Peixia ;

Li, Chuming ;

Li, Baopu ;

Bai, Lei ;

Lin, Chen ;

Sun, Ming ;

Yan, Junjie ;

Ouyang, Wanli .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :12-21

[8] BN-NAS: Neural Architecture Search with Batch Normalization [J].

Chen, Boyu ;

Li, Peixia ;

Li, Baopu ;

Lin, Chen ;

Li, Chuming ;

Sun, Ming ;

Yan, Junjie ;

Ouyang, Wanli .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :307-316

[9]

Chen Minghao, 2021, Autoformer: Searching transformers for visual recognition

[10] Moisture channels and pre-existing weather systems for East Asian rain belts [J].

Cheng, Tat Fan ;

Lu, Mengqian ;

Dai, Lun .

NPJ CLIMATE AND ATMOSPHERIC SCIENCE, 2021, 4 (01)

← 1 2 3 4 5 6 7 →