DaViT: Dual Attention Vision Transformers

被引:159
作者
Ding, Mingyu [1 ]
Xiao, Bin [2 ]
Codella, Noel [2 ]
Luo, Ping [1 ]
Wang, Jingdong [3 ]
Yuan, Lu [2 ]
机构
[1] Univ Hong Kong, Pok Fu Lam, Hong Kong, Peoples R China
[2] Microsoft, Bellevue, WA 98004 USA
[3] Baidu, Beijing, Peoples R China
来源
COMPUTER VISION, ECCV 2022, PT XXIV | 2022年 / 13684卷
关键词
D O I
10.1007/978-3-031-20053-3_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both "spatial tokens" and "channel tokens". With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show DaViT backbones achieve state-of-the-art performance on four different tasks. Specially, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K without extra training data, using 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Giant reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/microsoft/DaViT.
引用
收藏
页码:74 / 92
页数:19
相关论文
共 78 条
  • [1] Ali A., 2021, NeurIPS, V34
  • [2] [Anonymous], 2021, NeurIPS
  • [3] Bello I, 2021, Arxiv, DOI arXiv:2102.08602
  • [4] Berman M, 2019, Arxiv, DOI arXiv:1902.05509
  • [5] Brock A, 2021, INT C MACHINE LEARNI, V139
  • [6] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
    Chen, Chun-Fu
    Fan, Quanfu
    Panda, Rameswar
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 347 - 356
  • [7] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
    Chen, Liang-Chieh
    Zhu, Yukun
    Papandreou, George
    Schroff, Florian
    Adam, Hartwig
    [J]. COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 : 833 - 851
  • [8] Chen M, 2020, PR MACH LEARN RES, V119
  • [9] Visformer: The Vision-friendly Transformer
    Chen, Zhengsu
    Xie, Lingxi
    Niu, Jianwei
    Liu, Xuefeng
    Wei, Longhui
    Tian, Qi
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 569 - 578
  • [10] Chu X., 2021, arXiv