Dynamic Channel Token Vision Transformer with linear computation complexity and multi-scale features

被引：0

作者：

Guan, Yijia ^{[1
]}

Wang, Kundong ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 630卷

关键词：

Deep learning; Vision Transformer; Channel token;

D O I：

10.1016/j.neucom.2025.129696

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Original self-attention has the problem of quadratical complexity. In this paper, we propose a novel paradigm for tokenization that decouples the token scope from the spatial dimension. This new approach introduces dynamic tokens, which reduce computational complexity to linear while capturing multi-scale features. This paradigm is implemented in the proposed Dynamic Channel Token Vision Transformer (DCT-ViT), combining Window Self-Attention (WSA) and Dynamic Channel Self-Attention (DCSA) to capture both fine-grained and coarse-grained features. Our hierarchical window settings in DCSA prioritizes small tokens. DCT-ViT-S/B achieves a 82.9%/84.3% Top-1 accuracy on ImageNet-1k (Deng et al., 2009) and a 47.9/49.8 mAPb and a 43.4/44.6 mAPm on COCO 2017 (Lin et al., 2014) for Mask R-CNN (He et al., 2017) 3x schedule. The visualization of features in DCSA shows that dynamic channel tokens recognize objects at very early stages.

引用

页数：9

共 50 条

[31] Accurate Facial Landmark Detector via Multi-scale Transformer
Sha, Yuyang
Meng, Weiyu
Zhai, Xiaobing
Xie, Can
Li, Kefeng
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT V, 2024, 14429 : 278 - 290
[32] Fine-Grained Modulation Classification Using Multi-Scale Radio Transformer With Dual-Channel Representation
Zheng, Qinghe
Zhao, Penghui
Wang, Hongjun
Elhanashi, Abdussalam
Saponara, Sergio
IEEE COMMUNICATIONS LETTERS, 2022, 26 (06) : 1298 - 1302
[33] TCAMS-Trans: Efficient temporal-channel attention multi-scale transformer for net load forecasting
Zhang, Qingyong
Zhou, Shiyang
Xu, Bingrong
Li, Xinran
COMPUTERS & ELECTRICAL ENGINEERING, 2024, 118
[34] A Multi-Scale Channel Attention Network for Prostate Segmentation
Ding, Meiwen
Lin, Zhiping
Lee, Chau Hung
Tan, Cher Heng
Huang, Weimin
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2023, 70 (05) : 1754 - 1758
[35] One Model to Synthesize Them All: Multi-Contrast Multi-Scale Transformer for Missing Data Imputation
Liu, Jiang
Pasumarthi, Srivathsa
Duffy, Ben
Gong, Enhao
Datta, Keshav
Zaharchuk, Greg
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (09) : 2577 - 2591
[36] MAFormer: A transformer network with multi-scale attention fusion for visual recognition
Sun, Huixin
Wang, Yunhao
Wang, Xiaodi
Zhang, Bin
Xin, Ying
Zhang, Baochang
Cao, Xianbin
Ding, Errui
Han, Shumin
NEUROCOMPUTING, 2024, 595
[37] Automatic center identification of electron diffraction with multi-scale transformer networks
Ge, Mengshu
Pan, Yue
Liu, Xiaozhi
Zhao, Zhicheng
Su, Dong
ULTRAMICROSCOPY, 2024, 259
[38] Study of EEG classification of depression by multi-scale convolution combined with the Transformer
Zhai F.-W.
Sun F.
Jin J.
Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2024, 51 (02): : 182 - 195
[39] Transformer-based Multi-scale Underwater Image Enhancement Network
Yang, Ai-Ping
Fang, Si-Jie
Shao, Ming-Fu
Zhang, Teng-Fei
Dongbei Daxue Xuebao/Journal of Northeastern University, 2024, 45 (12): : 1696 - 1705
[40] Hierarchical Transformer with Multi-Scale Parallel Aggregation for Breast Tumor Segmentation
Xia, Ping
Wang, Yudie
Lei, Bangjun
Peng, Cheng
Zhang, Guangyi
Tang, Tinglong
LASER & OPTOELECTRONICS PROGRESS, 2025, 62 (02)

← 1 2 3 4 5 →