DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation

被引:0
作者
Li, Ke [1 ]
Wang, Di [1 ]
Liu, Gang [1 ]
Zhu, Wenxuan [1 ]
Zhong, Haodi [1 ]
Wang, Quan [1 ]
机构
[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China
关键词
Vision transformer; Multi-scale; Diagonal-shaped windows; Object detection and semantic segmentation;
D O I
10.1016/j.neunet.2024.106653
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Vision Transformer and its variants have demonstrated remarkable performance on various computer vision tasks, thanks to its competence in capturing global visual dependencies through self-attention. However, global self-attention suffers from high computational cost due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection and semantic segmentation). Many recent works have attempted to reduce the cost by applying fine-grained local attention, but these approaches cripple the long-range modeling power of the original self-attention mechanism. Furthermore, these approaches usually have similar receptive fields within each layer, thus limiting the ability of each self-attention layer to capture multi-scale features, resulting in performance degradation when handling images with objects of different scales. To address these issues, we develop the Diagonal-shaped Window (DiagSWin) attention mechanism for modeling attentions in diagonal regions at hybrid scales per attention layer. The key idea of DiagSWin attention is to inject multi-scale receptive field sizes into tokens: before computing the self-attention matrix, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity. This mechanism is able to effectively capture multi-scale context information while reducing computational complexity. With DiagSwin attention, we present a new variant of Vision Transformer models, called DiagSWin Transformers, and demonstrate their superiority in extensive experiments across various tasks. Specifically, the DiagSwin Transformer with a large size achieves 84.4% Top-1 accuracy and outperforms the SOTA CSWin Transformer on ImageNet with 40% fewer model size and computation cost. When employed as backbones, DiagSWin Transformers achieve significant improvements over the current SOTA modules. In addition, our DiagSWin-Base model yields 51.1 box mAP and 45.8 mask mAP on COCO for object detection and segmentation, and 52.3 mIoU on the ADE20K for semantic segmentation.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection
    Fang, Sikai
    Lu, Xiaofeng
    Huang, Yifan
    Sun, Guangling
    Liu, Xuefeng
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (25) : 67213 - 67229
  • [32] Dynamic Channel Token Vision Transformer with linear computation complexity and multi-scale features
    Guan, Yijia
    Wang, Kundong
    NEUROCOMPUTING, 2025, 630
  • [33] RI-ViT: A Multi-Scale Hybrid Method Based on Vision Transformer for Breast Cancer Detection in Histopathological Images
    Monjezi, Ehsan
    Akbarizadeh, Gholamreza
    Ansari-Asl, Karim
    IEEE ACCESS, 2024, 12 : 186074 - 186086
  • [34] Cascade multi-scale object detection on high-resolution images
    Novoselov, Alexey
    Dyakov, Oleg
    Kostromin, Igor
    Pogibelskiy, Dmitry
    2019 INTERNATIONAL CONFERENCE ON ENGINEERING AND TELECOMMUNICATION (ENT), 2019,
  • [35] MULTI-SCALE SAMPLE SELECTION BASED ON STATISTICAL CHARACTERISTICS FOR OBJECT DETECTION
    Li, Zhiguo
    Yuan, Yuan
    Ma, Dandan
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 1485 - 1489
  • [36] Enhancement and Fusion of Multi-Scale Feature Maps for Small Object Detection
    Xue, Zhijun
    Chen, Wenjie
    Li, Jing
    PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 7212 - 7217
  • [37] MDFN: Multi-scale deep feature learning network for object detection
    Ma, Wenchi
    Wu, Yuanwei
    Cen, Feng
    Wang, Guanghui
    PATTERN RECOGNITION, 2020, 100
  • [38] Bridging Multi-Scale Context-Aware Representation for Object Detection
    Wang, Boying
    Ji, Ruyi
    Zhang, Libo
    Wu, Yanjun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (05) : 2317 - 2329
  • [39] OBJECT-ORIENTED CHANGE DETECTION BASED ON MULTI-SCALE APPROACH
    Jia, Yonghong
    Zhou, Mingting
    Ye Jinshan
    XXIII ISPRS CONGRESS, COMMISSION VII, 2016, 41 (B7): : 517 - 522
  • [40] Specific Windows Search for Multi-Ship and Multi-Scale Wake Detection in SAR Images
    Ding, Kaiyang
    Yang, Junfeng
    Wang, Zhao
    Ni, Kai
    Wang, Xiaohao
    Zhou, Qian
    REMOTE SENSING, 2022, 14 (01)