DynaSlim: Dynamic Slimming for Vision Transformers

被引:0
作者
Shi, Da [1 ]
Gao, Jingsheng [1 ]
Liu, Ting [1 ]
Fu, Yuzhuo [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai, Peoples R China
来源
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年
基金
中国国家自然科学基金;
关键词
Model compression; vision transformer;
D O I
10.1109/ICME55011.2023.00251
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision transformers (ViTs) have achieved significant performance on various vision tasks. However, high computational and memory costs hinder their edge deployment. Existing compression methods employ static constraints between accuracy and efficiency during sparsification. The static constraints restrict the sparsification efficiency and their initialization relies heavily on human expertise. We propose a dynamic slimming strategy for ViT, DynaSlim, to achieve an adaptive accuracy-efficiency constraint during sparsification. We first equip fine-grained, adjustable sparsity weights, the scaling factor between accuracy and efficiency, for multiple dimensions, including input tokens, Multihead Self-Attention (MSA) and Multilayer Perceptron (MLP). We then employ the heuristic search for these non-differentiable factors and combine the search with regularization-based sparsification to obtain the optimal sparsed model. Finally, we compress and retrain the sparsed model under various budgets to get our resulting submodels. Experiments show that our DynaSlim outperforms previous state-of-the-art methods under different budgets. For example, we reduce both parameters and FLOPs of DeiT-B by 39% while increasing its accuracy by 1.9% on ImageNet-1K. Moreover, we demonstrate the transferability of our compressed models on several downstream datasets.
引用
收藏
页码:1451 / 1456
页数:6
相关论文
共 26 条
  • [1] Bui Kevin, 2020, Advances in Visual Computing. 15th International Symposium, ISVC 2020. Proceedings. Lecture Notes in Computer Science (LNCS 12509), P39, DOI 10.1007/978-3-030-64556-4_4
  • [2] Changpinyo S, 2017, Arxiv, DOI arXiv:1702.06257
  • [3] Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space
    Chavan, Arnav
    Shen, Zhiqiang
    Liu, Zhuang
    Liu, Zechun
    Cheng, Kwang-Ting
    Xing, Eric
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4921 - 4931
  • [4] Transformer Interpretability Beyond Attention Visualization
    Chefer, Hila
    Gur, Shir
    Wolf, Lior
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 782 - 791
  • [5] GLiT: Neural Architecture Search for Global and Local Image Transformer
    Chen, Boyu
    Li, Peixia
    Li, Chuming
    Li, Baopu
    Bai, Lei
    Lin, Chen
    Sun, Ming
    Yan, Junjie
    Ouyang, Wanli
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 12 - 21
  • [6] AutoFormer: Searching Transformers for Visual Recognition
    Chen, Minghao
    Peng, Houwen
    Fu, Jianlong
    Ling, Haibin
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 12250 - 12260
  • [7] DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers
    Chen, Xianing
    Cao, Qiong
    Zhong, Yujie
    Zhang, Jing
    Gao, Shenghua
    Tao, Dacheng
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12042 - 12052
  • [8] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [9] Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]
  • [10] Hendrycks D, 2020, Arxiv, DOI arXiv:1606.08415