Dynamic Channel Token Vision Transformer with linear computation complexity and multi-scale features

被引:0
|
作者
Guan, Yijia [1 ]
Wang, Kundong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai, Peoples R China
关键词
Deep learning; Vision Transformer; Channel token;
D O I
10.1016/j.neucom.2025.129696
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Original self-attention has the problem of quadratical complexity. In this paper, we propose a novel paradigm for tokenization that decouples the token scope from the spatial dimension. This new approach introduces dynamic tokens, which reduce computational complexity to linear while capturing multi-scale features. This paradigm is implemented in the proposed Dynamic Channel Token Vision Transformer (DCT-ViT), combining Window Self-Attention (WSA) and Dynamic Channel Self-Attention (DCSA) to capture both fine-grained and coarse-grained features. Our hierarchical window settings in DCSA prioritizes small tokens. DCT-ViT-S/B achieves a 82.9%/84.3% Top-1 accuracy on ImageNet-1k (Deng et al., 2009) and a 47.9/49.8 mAPb and a 43.4/44.6 mAPm on COCO 2017 (Lin et al., 2014) for Mask R-CNN (He et al., 2017) 3x schedule. The visualization of features in DCSA shows that dynamic channel tokens recognize objects at very early stages.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Data-efficient multi-scale fusion vision transformer
    Tang, Hao
    Liu, Dawei
    Shen, Chengchao
    PATTERN RECOGNITION, 2025, 161
  • [2] DeepFake detection with multi-scale convolution and vision transformer
    Lin, Hao
    Huang, Wenmin
    Luo, Weiqi
    Lu, Wei
    DIGITAL SIGNAL PROCESSING, 2023, 134
  • [3] MSAPVT: a multi-scale attention pyramid vision transformer network for large-scale fruit recognition
    Rao, Yao
    Li, Chaofeng
    Xu, Feiran
    Guo, Ya
    JOURNAL OF FOOD MEASUREMENT AND CHARACTERIZATION, 2024, 18 (11) : 9233 - 9251
  • [4] A Robust Image Semantic Communication System With Multi-Scale Vision Transformer
    Peng, Xiang
    Qin, Zhijin
    Tao, Xiaoming
    Lu, Jianhua
    Letaief, Khaled B.
    IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2025, 43 (04) : 1278 - 1291
  • [5] MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition
    Huo, Hua
    Li, Bingjie
    ELECTRONICS, 2024, 13 (05)
  • [6] Automatic pruning rate adjustment for dynamic token reduction in vision transformer
    Ishibashi, Ryuto
    Meng, Lin
    APPLIED INTELLIGENCE, 2025, 55 (05)
  • [7] Matching Multi-Scale Feature Sets in Vision Transformer for Few-Shot Classification
    Song, Mingchen
    Yao, Fengqin
    Zhong, Guoqiang
    Ji, Zhong
    Zhang, Xiaowei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 12638 - 12651
  • [8] Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation
    Rahman, Md Mostafijur
    Marculescu, Radu
    MEDICAL IMAGING WITH DEEP LEARNING, VOL 227, 2023, 227 : 1526 - 1544
  • [9] A Novel Multi-Scale Transformer for Object Detection in Aerial Scenes
    Lu, Guanlin
    He, Xiaohui
    Wang, Qiang
    Shao, Faming
    Wang, Hongwei
    Wang, Jinkang
    DRONES, 2022, 6 (08)
  • [10] DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation
    Li, Ke
    Wang, Di
    Liu, Gang
    Zhu, Wenxuan
    Zhong, Haodi
    Wang, Quan
    NEURAL NETWORKS, 2024, 180