SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning

被引:61
作者
Kong, Zhenglun [1 ]
Dong, Peiyan [1 ]
Ma, Xiaolong [2 ]
Meng, Xin [3 ]
Niu, Wei [4 ]
Sun, Mengshu [1 ]
Shen, Xuan [1 ]
Yuan, Geng [1 ]
Ren, Bin [4 ]
Tang, Hao [5 ]
Qin, Minghai [1 ]
Wang, Yanzhi [1 ]
机构
[1] Northeastern Univ, Boston, MA 02115 USA
[2] Clemson Univ, Clemson, SC 29634 USA
[3] Peking Univ, Beijing 100871, Peoples R China
[4] Coll William & Mary, Williamsburg, VA 23185 USA
[5] Swiss Fed Inst Technol, CVL, CH-8092 Zurich, Switzerland
来源
COMPUTER VISION, ECCV 2022, PT XI | 2022年 / 13671卷
基金
美国国家科学基金会;
关键词
Vision transformer; Model compression; Hardware acceleration; Mobile devices; FPGA;
D O I
10.1007/978-3-031-20083-0_37
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Considering the computation complexity, the internal data pattern of ViTs, and the edge device deployment, we propose a latency-aware soft token pruning framework, SPViT, which can be set up on vanilla Transformers of both flatten and hierarchical structures, such as DeiTs and Swin-Transformers (Swin). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique, which integrates the less informative tokens chosen by the selector module into a package token rather than discarding them completely. SPViT is bound to the trade-off between accuracy and latency requirements of specific edge devices through our proposed latency-aware training strategy. Experiment results show that SPViT significantly reduces the computation cost of ViTs with comparable performance on image classification. Moreover, SPViT can guarantee the identified model meets the latency specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile devices. For example, SPViT reduces the latency of DeiT-T to 26 ms (26%-41% superior to existing works) on the mobile device with 0.25%-4% higher top-1 accuracy on ImageNet. Our code is released at https://github.com/PeiyanFlying/SPViT.
引用
收藏
页码:620 / 640
页数:21
相关论文
共 110 条
[1]  
Amini A., 2021, arXiv
[2]  
Bao H, 2022, INT C LEARNING REPRE
[3]  
Carion N, 2020, Img Proc Comp Vis Re, V12346, P213, DOI 10.1007/978-3-030-58452-8_13
[4]  
Chang SE, 2021, INT S HIGH PERF COMP, P208, DOI [10.1109/HPCA51647.2021.00027, 10.1109/WRCSARA53879.2021.9612678]
[5]   Transformer Interpretability Beyond Attention Visualization [J].
Chefer, Hila ;
Gur, Shir ;
Wolf, Lior .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :782-791
[6]  
Chen BY, 2021, Arxiv, DOI arXiv:2108.03428
[7]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[8]   Pre-Trained Image Processing Transformer [J].
Chen, Hanting ;
Wang, Yunhe ;
Guo, Tianyu ;
Xu, Chang ;
Deng, Yiping ;
Liu, Zhenhua ;
Ma, Siwei ;
Xu, Chunjing ;
Xu, Chao ;
Gao, Wen .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12294-12305
[9]  
Chen M., 2021, PROC IEEECVF INT C C, P12270
[10]  
Chen P., 2021, ARXIV