Dynamic Channel Token Vision Transformer with linear computation complexity and multi-scale features

被引：0

作者：

Guan, Yijia ^{[1
]}

Wang, Kundong ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 630卷

关键词：

Deep learning; Vision Transformer; Channel token;

D O I：

10.1016/j.neucom.2025.129696

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Original self-attention has the problem of quadratical complexity. In this paper, we propose a novel paradigm for tokenization that decouples the token scope from the spatial dimension. This new approach introduces dynamic tokens, which reduce computational complexity to linear while capturing multi-scale features. This paradigm is implemented in the proposed Dynamic Channel Token Vision Transformer (DCT-ViT), combining Window Self-Attention (WSA) and Dynamic Channel Self-Attention (DCSA) to capture both fine-grained and coarse-grained features. Our hierarchical window settings in DCSA prioritizes small tokens. DCT-ViT-S/B achieves a 82.9%/84.3% Top-1 accuracy on ImageNet-1k (Deng et al., 2009) and a 47.9/49.8 mAPb and a 43.4/44.6 mAPm on COCO 2017 (Lin et al., 2014) for Mask R-CNN (He et al., 2017) 3x schedule. The visualization of features in DCSA shows that dynamic channel tokens recognize objects at very early stages.

引用

页数：9

共 50 条

[21] DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition
Jiao, Jiayu
Tang, Yu-Ming
Lin, Kun-Yu
Gao, Yipeng
Ma, Andy J.
Wang, Yaowei
Zheng, Wei-Shi
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8906 - 8919
[22] Multi-scale transformer with conditioned prompt for image deraining
Wu, Xianhao
Chen, Hongming
Chen, Xiang
Xu, Guili
DIGITAL SIGNAL PROCESSING, 2025, 156
[23] Token labeling-guided multi-scale medical image classification
Yan, Fangyuan
Yan, Bin
Liang, Wei
Pei, Mingtao
PATTERN RECOGNITION LETTERS, 2024, 178 : 28 - 34
[24] Multi-scale Transformer with Decoder for Image Quality Assessment
Zhang, Shuai
Liu, Yutao
ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I, 2024, 14473 : 220 - 231
[25] Facial expression-based emotion recognition across diverse age groups: a multi-scale vision transformer with contrastive learning approach
Balachandran, G.
Ranjith, S.
Chenthil, T. R.
Jagan, G. C.
JOURNAL OF COMBINATORIAL OPTIMIZATION, 2025, 49 (01)
[26] Multi-Channel Vision Transformer for Epileptic Seizure Prediction
Hussein, Ramy
Lee, Soojin
Ward, Rabab
BIOMEDICINES, 2022, 10 (07)
[27] A novel multi-scale network intrusion detection model with transformer
Xi, Chiming
Wang, Hui
Wang, Xubin
SCIENTIFIC REPORTS, 2024, 14 (01):
[28] Image Tampering Localization Based on Visual Multi-Scale Transformer
Lu L.
Zhong W.
Wu X.
Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2022, 50 (06): : 10 - 18
[29] Multi-scale Transformer 3D Plane Recovery
Ren, Fei
Chang, Qingling
Liu, Xinglin
Cui, Yan
FOURTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING, ICGIP 2022, 2022, 12705
[30] Multi-scale nested UNet with transformer for colorectal polyp segmentation
Wang, Zenan
Liu, Zhen
Yu, Jianfeng
Gao, Yingxin
Liu, Ming
JOURNAL OF APPLIED CLINICAL MEDICAL PHYSICS, 2024, 25 (06):

← 1 2 3 4 5 →