SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

被引:21
作者
Chen, Xuanyao [1 ,2 ]
Liu, Zhijian [4 ]
Tang, Haotian [4 ]
Yi, Li [1 ,3 ]
Zhao, Hang [1 ,3 ]
Han, Song [4 ]
机构
[1] Shanghai Qi Zhi Inst, Shanghai, Peoples R China
[2] Fudan Univ, Shanghai, Peoples R China
[3] Tsinghua Univ, Beijing, Peoples R China
[4] MIT, Cambridge, MA USA
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
基金
美国国家科学基金会;
关键词
D O I
10.1109/CVPR52729.2023.00205
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
High-resolution images enable neural networks to learn richer visual representations. However, this improved performance comes at the cost of growing computational complexity, hindering their usage in latency-sensitive applications. As not all pixels are equal, skipping computations for less-important regions offers a simple and effective measure to reduce the computation. This, however, is hard to be translated into actual speedup for CNNs since it breaks the regularity of the dense convolution workload. In this paper, we introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs). As window attentions are naturally batched over blocks, actual speedup with window activation pruning becomes possible: i.e., similar to 50% latency reduction with 60% sparsity. Different layers should be assigned with different pruning ratios due to their diverse sensitivities and computational costs. We introduce sparsity-aware adaptation and apply the evolutionary search to efficiently find the optimal layerwise sparsity configuration within the vast search space. SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy.
引用
收藏
页码:2061 / 2070
页数:10
相关论文
共 60 条
  • [1] Caesar H, 2020, PROC CVPR IEEE, P11618, DOI 10.1109/CVPR42600.2020.01164
  • [2] Cai Han, 2019, INT C LEARN REPR
  • [3] Cai Han, 2020, ICLR, V2
  • [4] Chen Ting., 2021, NeurIPS
  • [5] Chen Yunpeng, 2021, ICCV
  • [6] Cheng Bowen, 2022, CVPR, V7
  • [7] The Cityscapes Dataset for Semantic Urban Scene Understanding
    Cordts, Marius
    Omran, Mohamed
    Ramos, Sebastian
    Rehfeld, Timo
    Enzweiler, Markus
    Benenson, Rodrigo
    Franke, Uwe
    Roth, Stefan
    Schiele, Bernt
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3213 - 3223
  • [8] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [10] Gong Chengyue, 2022, ICLR