Dynamic Channel Token Vision Transformer with linear computation complexity and multi-scale features

被引：0

作者：

Guan, Yijia ^{[1
]}

Wang, Kundong ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai, Peoples R China

来源：

NEUROCOMPUTING | 2025年 / 630卷

关键词：

Deep learning; Vision Transformer; Channel token;

D O I：

10.1016/j.neucom.2025.129696

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Original self-attention has the problem of quadratical complexity. In this paper, we propose a novel paradigm for tokenization that decouples the token scope from the spatial dimension. This new approach introduces dynamic tokens, which reduce computational complexity to linear while capturing multi-scale features. This paradigm is implemented in the proposed Dynamic Channel Token Vision Transformer (DCT-ViT), combining Window Self-Attention (WSA) and Dynamic Channel Self-Attention (DCSA) to capture both fine-grained and coarse-grained features. Our hierarchical window settings in DCSA prioritizes small tokens. DCT-ViT-S/B achieves a 82.9%/84.3% Top-1 accuracy on ImageNet-1k (Deng et al., 2009) and a 47.9/49.8 mAPb and a 43.4/44.6 mAPm on COCO 2017 (Lin et al., 2014) for Mask R-CNN (He et al., 2017) 3x schedule. The visualization of features in DCSA shows that dynamic channel tokens recognize objects at very early stages.

引用

页数：9

共 50 条

[1] Data-efficient multi-scale fusion vision transformer
Tang, Hao
Liu, Dawei
Shen, Chengchao
PATTERN RECOGNITION, 2025, 161
[2] DeepFake detection with multi-scale convolution and vision transformer
Lin, Hao
Huang, Wenmin
Luo, Weiqi
Lu, Wei
DIGITAL SIGNAL PROCESSING, 2023, 134
[3] MSAPVT: a multi-scale attention pyramid vision transformer network for large-scale fruit recognition
Rao, Yao
Li, Chaofeng
Xu, Feiran
Guo, Ya
JOURNAL OF FOOD MEASUREMENT AND CHARACTERIZATION, 2024, 18 (11) : 9233 - 9251
[4] A Robust Image Semantic Communication System With Multi-Scale Vision Transformer
Peng, Xiang
Qin, Zhijin
Tao, Xiaoming
Lu, Jianhua
Letaief, Khaled B.
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2025, 43 (04) : 1278 - 1291
[5] MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition
Huo, Hua
Li, Bingjie
ELECTRONICS, 2024, 13 (05)
[6] Automatic pruning rate adjustment for dynamic token reduction in vision transformer
Ishibashi, Ryuto
Meng, Lin
APPLIED INTELLIGENCE, 2025, 55 (05)
[7] Matching Multi-Scale Feature Sets in Vision Transformer for Few-Shot Classification
Song, Mingchen
Yao, Fengqin
Zhong, Guoqiang
Ji, Zhong
Zhang, Xiaowei
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 12638 - 12651
[8] Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation
Rahman, Md Mostafijur
Marculescu, Radu
MEDICAL IMAGING WITH DEEP LEARNING, VOL 227, 2023, 227 : 1526 - 1544
[9] A Novel Multi-Scale Transformer for Object Detection in Aerial Scenes
Lu, Guanlin
He, Xiaohui
Wang, Qiang
Shao, Faming
Wang, Hongwei
Wang, Jinkang
DRONES, 2022, 6 (08)
[10] DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation
Li, Ke
Wang, Di
Liu, Gang
Zhu, Wenxuan
Zhong, Haodi
Wang, Quan
NEURAL NETWORKS, 2024, 180

← 1 2 3 4 5 →