DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation

被引:0
作者
Li, Ke [1 ]
Wang, Di [1 ]
Liu, Gang [1 ]
Zhu, Wenxuan [1 ]
Zhong, Haodi [1 ]
Wang, Quan [1 ]
机构
[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China
关键词
Vision transformer; Multi-scale; Diagonal-shaped windows; Object detection and semantic segmentation;
D O I
10.1016/j.neunet.2024.106653
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Vision Transformer and its variants have demonstrated remarkable performance on various computer vision tasks, thanks to its competence in capturing global visual dependencies through self-attention. However, global self-attention suffers from high computational cost due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection and semantic segmentation). Many recent works have attempted to reduce the cost by applying fine-grained local attention, but these approaches cripple the long-range modeling power of the original self-attention mechanism. Furthermore, these approaches usually have similar receptive fields within each layer, thus limiting the ability of each self-attention layer to capture multi-scale features, resulting in performance degradation when handling images with objects of different scales. To address these issues, we develop the Diagonal-shaped Window (DiagSWin) attention mechanism for modeling attentions in diagonal regions at hybrid scales per attention layer. The key idea of DiagSWin attention is to inject multi-scale receptive field sizes into tokens: before computing the self-attention matrix, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity. This mechanism is able to effectively capture multi-scale context information while reducing computational complexity. With DiagSwin attention, we present a new variant of Vision Transformer models, called DiagSWin Transformers, and demonstrate their superiority in extensive experiments across various tasks. Specifically, the DiagSwin Transformer with a large size achieves 84.4% Top-1 accuracy and outperforms the SOTA CSWin Transformer on ImageNet with 40% fewer model size and computation cost. When employed as backbones, DiagSWin Transformers achieve significant improvements over the current SOTA modules. In addition, our DiagSWin-Base model yields 51.1 box mAP and 45.8 mask mAP on COCO for object detection and segmentation, and 52.3 mIoU on the ADE20K for semantic segmentation.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] MRMNet: Multi-scale residual multi-branch neural network for object detection
    Dong, Yongsheng
    Liu, Yafeng
    Li, Xuelong
    NEUROCOMPUTING, 2024, 596
  • [42] Multi-scale multi-object semi-supervised consistency learning for ultrasound image segmentation
    Guo, Saidi
    Liu, Zhaoshan
    Yang, Ziduo
    Lee, Chau Hung
    Lv, Qiujie
    Shen, Lei
    NEURAL NETWORKS, 2025, 184
  • [43] Transformer guided self-adaptive network for multi-scale skin lesion image segmentation
    Xin, Chao
    Liu, Zhifang
    Ma, Yizhao
    Wang, Dianchen
    Zhang, Jing
    Li, Lingzhi
    Zhou, Qiongyan
    Xu, Suling
    Zhang, Yingying
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 169
  • [44] LTUNet: A Lightweight Transformer-Based UNet with Multi-scale Mechanism for Skin Lesion Segmentation
    Guo, Huike
    Zhang, Han
    Li, Minghe
    Quan, Xiongwen
    ARTIFICIAL INTELLIGENCE, CICAI 2023, PT II, 2024, 14474 : 147 - 158
  • [45] SAN: Learning Relationship Between Convolutional Features for Multi-scale Object Detection
    Kim, Yonghyun
    Kang, Bong-Nam
    Kim, Daijin
    COMPUTER VISION - ECCV 2018, PT V, 2018, 11209 : 328 - 343
  • [46] Remote Sensing Rotating Object Detection Based on Multi-Scale Feature Extraction
    Wu, Luobing
    Gu, Yuhai
    Wu, Wenhao
    Fan, Shuaixin
    LASER & OPTOELECTRONICS PROGRESS, 2023, 60 (12)
  • [47] A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection
    Cai, Zhaowei
    Fan, Quanfu
    Feris, Rogerio S.
    Vasconcelos, Nuno
    COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 : 354 - 370
  • [48] Multi-scale deep encoder-decoder network for salient object detection
    Ren, Qinghua
    Hu, Renjie
    NEUROCOMPUTING, 2018, 316 : 95 - 104
  • [49] An empirical study of multi-scale object detection in high resolution UAV images
    Zhang, Haijun
    Sun, Mingshan
    Li, Qun
    Liu, Linlin
    Liu, Ming
    Ji, Yuzhu
    NEUROCOMPUTING, 2021, 421 : 173 - 182
  • [50] MULTI-SCALE DEFORMABLE TRANSFORMER ENCODER BASED SINGLE-STAGE PEDESTRIAN DETECTION
    Yuan, Jing
    Barmpoutis, Panagiotis
    Stathaki, Tania
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2906 - 2910