Dynamic Channel Token Vision Transformer with linear computation complexity and multi-scale features

被引:0
|
作者
Guan, Yijia [1 ]
Wang, Kundong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai, Peoples R China
关键词
Deep learning; Vision Transformer; Channel token;
D O I
10.1016/j.neucom.2025.129696
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Original self-attention has the problem of quadratical complexity. In this paper, we propose a novel paradigm for tokenization that decouples the token scope from the spatial dimension. This new approach introduces dynamic tokens, which reduce computational complexity to linear while capturing multi-scale features. This paradigm is implemented in the proposed Dynamic Channel Token Vision Transformer (DCT-ViT), combining Window Self-Attention (WSA) and Dynamic Channel Self-Attention (DCSA) to capture both fine-grained and coarse-grained features. Our hierarchical window settings in DCSA prioritizes small tokens. DCT-ViT-S/B achieves a 82.9%/84.3% Top-1 accuracy on ImageNet-1k (Deng et al., 2009) and a 47.9/49.8 mAPb and a 43.4/44.6 mAPm on COCO 2017 (Lin et al., 2014) for Mask R-CNN (He et al., 2017) 3x schedule. The visualization of features in DCSA shows that dynamic channel tokens recognize objects at very early stages.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition
    Jiao, Jiayu
    Tang, Yu-Ming
    Lin, Kun-Yu
    Gao, Yipeng
    Ma, Andy J.
    Wang, Yaowei
    Zheng, Wei-Shi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8906 - 8919
  • [22] Multi-scale transformer with conditioned prompt for image deraining
    Wu, Xianhao
    Chen, Hongming
    Chen, Xiang
    Xu, Guili
    DIGITAL SIGNAL PROCESSING, 2025, 156
  • [23] Token labeling-guided multi-scale medical image classification
    Yan, Fangyuan
    Yan, Bin
    Liang, Wei
    Pei, Mingtao
    PATTERN RECOGNITION LETTERS, 2024, 178 : 28 - 34
  • [24] Multi-scale Transformer with Decoder for Image Quality Assessment
    Zhang, Shuai
    Liu, Yutao
    ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I, 2024, 14473 : 220 - 231
  • [25] Facial expression-based emotion recognition across diverse age groups: a multi-scale vision transformer with contrastive learning approach
    Balachandran, G.
    Ranjith, S.
    Chenthil, T. R.
    Jagan, G. C.
    JOURNAL OF COMBINATORIAL OPTIMIZATION, 2025, 49 (01)
  • [26] Multi-Channel Vision Transformer for Epileptic Seizure Prediction
    Hussein, Ramy
    Lee, Soojin
    Ward, Rabab
    BIOMEDICINES, 2022, 10 (07)
  • [27] A novel multi-scale network intrusion detection model with transformer
    Xi, Chiming
    Wang, Hui
    Wang, Xubin
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [28] Image Tampering Localization Based on Visual Multi-Scale Transformer
    Lu L.
    Zhong W.
    Wu X.
    Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2022, 50 (06): : 10 - 18
  • [29] Multi-scale Transformer 3D Plane Recovery
    Ren, Fei
    Chang, Qingling
    Liu, Xinglin
    Cui, Yan
    FOURTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING, ICGIP 2022, 2022, 12705
  • [30] Multi-scale nested UNet with transformer for colorectal polyp segmentation
    Wang, Zenan
    Liu, Zhen
    Yu, Jianfeng
    Gao, Yingxin
    Liu, Ming
    JOURNAL OF APPLIED CLINICAL MEDICAL PHYSICS, 2024, 25 (06):