Neighborhood Attention Transformer

被引:114
作者
Hassani, Ali [1 ,2 ]
Walton, Steven [1 ,2 ]
Li, Jiachen [1 ,2 ]
Li, Shen [4 ]
Shi, Humphrey [1 ,2 ,3 ]
机构
[1] Univ Oregon, SHI Labs, Eugene, OR 97403 USA
[2] UIUC, Champaign, IL 61801 USA
[3] Picsart AI Res PAIR, New York, NY USA
[4] Meta Facebook AI, Menlo Pk, CA USA
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
关键词
D O I
10.1109/CVPR52729.2023.00599
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Neighborhood Attention (NA), the first efficient and scalable sliding window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA. The sliding window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA that boosts image classification and downstream vision performance. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9% ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. To support more research based on sliding window attention, we open source our project and release our checkpoints.
引用
收藏
页码:6185 / 6194
页数:10
相关论文
共 41 条
  • [1] Beltagy I, 2020, ARXIV PREPRINT ARXIV, DOI DOI 10.48550/ARXIV.2004.05150
  • [2] Brown TB, 2020, ARXIV, DOI DOI 10.48550/ARXIV.2005.14165
  • [3] Cascade R-CNN: Delving into High Quality Object Detection
    Cai, Zhaowei
    Vasconcelos, Nuno
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6154 - 6162
  • [4] Carion Nicolas, 2020, ARXIV
  • [5] Chen K., 2019, ar**v preprint ar**v:1906.07155
  • [6] Chen Richard, 2022, INT C LEARN REPR ICL
  • [7] Cubuk E. D., 2020, IEEE CVF C COMP VIS
  • [8] ConViT: improving vision transformers with soft convolutional inductive biases
    d'Ascoli, Stephane
    Touvron, Hugo
    Leavitt, Matthew L.
    Morcos, Ari S.
    Biroli, Giulio
    Sagun, Levent
    [J]. JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2022, 2022 (11):
  • [9] Dai Jifeng, 2017, IEEE CVF INT C COMP
  • [10] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848