DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation

被引：0

作者：

Li, Ke ^{[1
]}

Wang, Di ^{[1
]}

Liu, Gang ^{[1
]}

Zhu, Wenxuan ^{[1
]}

Zhong, Haodi ^{[1
]}

Wang, Quan ^{[1
]}

机构：

[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China

来源：

NEURAL NETWORKS | 2024年 / 180卷

关键词：

Vision transformer; Multi-scale; Diagonal-shaped windows; Object detection and semantic segmentation;

D O I：

10.1016/j.neunet.2024.106653

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, Vision Transformer and its variants have demonstrated remarkable performance on various computer vision tasks, thanks to its competence in capturing global visual dependencies through self-attention. However, global self-attention suffers from high computational cost due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection and semantic segmentation). Many recent works have attempted to reduce the cost by applying fine-grained local attention, but these approaches cripple the long-range modeling power of the original self-attention mechanism. Furthermore, these approaches usually have similar receptive fields within each layer, thus limiting the ability of each self-attention layer to capture multi-scale features, resulting in performance degradation when handling images with objects of different scales. To address these issues, we develop the Diagonal-shaped Window (DiagSWin) attention mechanism for modeling attentions in diagonal regions at hybrid scales per attention layer. The key idea of DiagSWin attention is to inject multi-scale receptive field sizes into tokens: before computing the self-attention matrix, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity. This mechanism is able to effectively capture multi-scale context information while reducing computational complexity. With DiagSwin attention, we present a new variant of Vision Transformer models, called DiagSWin Transformers, and demonstrate their superiority in extensive experiments across various tasks. Specifically, the DiagSwin Transformer with a large size achieves 84.4% Top-1 accuracy and outperforms the SOTA CSWin Transformer on ImageNet with 40% fewer model size and computation cost. When employed as backbones, DiagSWin Transformers achieve significant improvements over the current SOTA modules. In addition, our DiagSWin-Base model yields 51.1 box mAP and 45.8 mask mAP on COCO for object detection and segmentation, and 52.3 mIoU on the ADE20K for semantic segmentation.

引用

页数：14

共 50 条

[41] MRMNet: Multi-scale residual multi-branch neural network for object detection
Dong, Yongsheng
Liu, Yafeng
Li, Xuelong
NEUROCOMPUTING, 2024, 596
[42] Multi-scale multi-object semi-supervised consistency learning for ultrasound image segmentation
Guo, Saidi
Liu, Zhaoshan
Yang, Ziduo
Lee, Chau Hung
Lv, Qiujie
Shen, Lei
NEURAL NETWORKS, 2025, 184
[43] Transformer guided self-adaptive network for multi-scale skin lesion image segmentation
Xin, Chao
Liu, Zhifang
Ma, Yizhao
Wang, Dianchen
Zhang, Jing
Li, Lingzhi
Zhou, Qiongyan
Xu, Suling
Zhang, Yingying
COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 169
[44] LTUNet: A Lightweight Transformer-Based UNet with Multi-scale Mechanism for Skin Lesion Segmentation
Guo, Huike
Zhang, Han
Li, Minghe
Quan, Xiongwen
ARTIFICIAL INTELLIGENCE, CICAI 2023, PT II, 2024, 14474 : 147 - 158
[45] SAN: Learning Relationship Between Convolutional Features for Multi-scale Object Detection
Kim, Yonghyun
Kang, Bong-Nam
Kim, Daijin
COMPUTER VISION - ECCV 2018, PT V, 2018, 11209 : 328 - 343
[46] Remote Sensing Rotating Object Detection Based on Multi-Scale Feature Extraction
Wu, Luobing
Gu, Yuhai
Wu, Wenhao
Fan, Shuaixin
LASER & OPTOELECTRONICS PROGRESS, 2023, 60 (12)
[47] A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection
Cai, Zhaowei
Fan, Quanfu
Feris, Rogerio S.
Vasconcelos, Nuno
COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 : 354 - 370
[48] Multi-scale deep encoder-decoder network for salient object detection
Ren, Qinghua
Hu, Renjie
NEUROCOMPUTING, 2018, 316 : 95 - 104
[49] An empirical study of multi-scale object detection in high resolution UAV images
Zhang, Haijun
Sun, Mingshan
Li, Qun
Liu, Linlin
Liu, Ming
Ji, Yuzhu
NEUROCOMPUTING, 2021, 421 : 173 - 182
[50] MULTI-SCALE DEFORMABLE TRANSFORMER ENCODER BASED SINGLE-STAGE PEDESTRIAN DETECTION
Yuan, Jing
Barmpoutis, Panagiotis
Stathaki, Tania
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2906 - 2910

← 1 2 3 4 5 →