PSLT: A Light-Weight Vision Transformer With Ladder Self-Attention and Progressive Shift

被引:17
作者
Wu, Gaojie [1 ]
Zheng, Wei-Shi [1 ,2 ,3 ,4 ]
Lu, Yutong [1 ]
Tian, Qi [5 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou 510275, Guangdong, Peoples R China
[2] Peng Cheng Lab, Shenzhen 518005, Guangdong, Peoples R China
[3] Sun Yat Sen Univ, Key Lab Machine Intelligence & Adv Comp, Minist Educ, Guangzhou 510275, Guangdong, Peoples R China
[4] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou 510006, Guangdong, Peoples R China
[5] Huawei, Cloud & AI BU, Shenzhen 518129, Guangdong, Peoples R China
关键词
Ladder self-Attention; light-weight vision transformer; multimedia information retrieval;
D O I
10.1109/TPAMI.2023.3265499
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency. However, ViT requires a large amount of computing resource to compute the global self-attention. In this work, we propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone that requires less computing resources (e.g., a relatively small number of parameters and FLOPs), termed Progressive Shift Ladder Transformer (PSLT). First, the ladder self-attention block reduces the computational cost by modelling local self-attention in each branch. In the meanwhile, the progressive shift mechanism is proposed to enlarge the receptive field in the ladder self-attention block by modelling diverse local self-attention for each branch and interacting among these branches. Second, the input feature of the ladder self-attention block is split equally along the channel dimension for each branch, which considerably reduces the computational cost in the ladder self-attention block (with nearly 1/3 the amount of parameters and FLOPs), and the outputs of these branches are then collaborated by a pixel-adaptive fusion. Therefore, the ladder self-attention block with a relatively small number of parameters and FLOPs is capable of modelling long-range interactions. Based on the ladder self-attention block, PSLT performs well on several vision tasks, including image classification, objection detection and person re-identification. On the ImageNet-1 k dataset, PSLT achieves a top-1 accuracy of 79.9% with 9.2 M parameters and 1.9 G FLOPs, which is comparable to several existing models with more than 20 M parameters and 4 G FLOPs. Code is available at https://isee-ai.cn/wugaojie/PSLT.html.
引用
收藏
页码:11120 / 11135
页数:16
相关论文
共 88 条
[1]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[2]  
Chen Y., 2021, arXiv, DOI DOI 10.48550/ARXIV.2108.05895
[3]  
Chu XX, 2021, ADV NEUR IN
[4]   AutoAugment: Learning Augmentation Strategies from Data [J].
Cubuk, Ekin D. ;
Zoph, Barret ;
Mane, Dandelion ;
Vasudevan, Vijay ;
Le, Quoc V. .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :113-123
[5]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[6]  
Dai Z, 2021, ADV NEUR IN, V34
[7]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arxiv.1810.04805]
[8]  
Dong X., 2022, PROC IEEECVF C COMPU, p17 431
[9]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[10]   Multiscale Vision Transformers [J].
Fan, Haoqi ;
Xiong, Bo ;
Mangalam, Karttikeya ;
Li, Yanghao ;
Yan, Zhicheng ;
Malik, Jitendra ;
Feichtenhofer, Christoph .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6804-6815