Dynamic Channel Token Vision Transformer with linear computation complexity and multi-scale features

被引:0
|
作者
Guan, Yijia [1 ]
Wang, Kundong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai, Peoples R China
关键词
Deep learning; Vision Transformer; Channel token;
D O I
10.1016/j.neucom.2025.129696
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Original self-attention has the problem of quadratical complexity. In this paper, we propose a novel paradigm for tokenization that decouples the token scope from the spatial dimension. This new approach introduces dynamic tokens, which reduce computational complexity to linear while capturing multi-scale features. This paradigm is implemented in the proposed Dynamic Channel Token Vision Transformer (DCT-ViT), combining Window Self-Attention (WSA) and Dynamic Channel Self-Attention (DCSA) to capture both fine-grained and coarse-grained features. Our hierarchical window settings in DCSA prioritizes small tokens. DCT-ViT-S/B achieves a 82.9%/84.3% Top-1 accuracy on ImageNet-1k (Deng et al., 2009) and a 47.9/49.8 mAPb and a 43.4/44.6 mAPm on COCO 2017 (Lin et al., 2014) for Mask R-CNN (He et al., 2017) 3x schedule. The visualization of features in DCSA shows that dynamic channel tokens recognize objects at very early stages.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Accurate Facial Landmark Detector via Multi-scale Transformer
    Sha, Yuyang
    Meng, Weiyu
    Zhai, Xiaobing
    Xie, Can
    Li, Kefeng
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT V, 2024, 14429 : 278 - 290
  • [32] Fine-Grained Modulation Classification Using Multi-Scale Radio Transformer With Dual-Channel Representation
    Zheng, Qinghe
    Zhao, Penghui
    Wang, Hongjun
    Elhanashi, Abdussalam
    Saponara, Sergio
    IEEE COMMUNICATIONS LETTERS, 2022, 26 (06) : 1298 - 1302
  • [33] TCAMS-Trans: Efficient temporal-channel attention multi-scale transformer for net load forecasting
    Zhang, Qingyong
    Zhou, Shiyang
    Xu, Bingrong
    Li, Xinran
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 118
  • [34] A Multi-Scale Channel Attention Network for Prostate Segmentation
    Ding, Meiwen
    Lin, Zhiping
    Lee, Chau Hung
    Tan, Cher Heng
    Huang, Weimin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2023, 70 (05) : 1754 - 1758
  • [35] One Model to Synthesize Them All: Multi-Contrast Multi-Scale Transformer for Missing Data Imputation
    Liu, Jiang
    Pasumarthi, Srivathsa
    Duffy, Ben
    Gong, Enhao
    Datta, Keshav
    Zaharchuk, Greg
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (09) : 2577 - 2591
  • [36] MAFormer: A transformer network with multi-scale attention fusion for visual recognition
    Sun, Huixin
    Wang, Yunhao
    Wang, Xiaodi
    Zhang, Bin
    Xin, Ying
    Zhang, Baochang
    Cao, Xianbin
    Ding, Errui
    Han, Shumin
    NEUROCOMPUTING, 2024, 595
  • [37] Automatic center identification of electron diffraction with multi-scale transformer networks
    Ge, Mengshu
    Pan, Yue
    Liu, Xiaozhi
    Zhao, Zhicheng
    Su, Dong
    ULTRAMICROSCOPY, 2024, 259
  • [38] Study of EEG classification of depression by multi-scale convolution combined with the Transformer
    Zhai F.-W.
    Sun F.
    Jin J.
    Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2024, 51 (02): : 182 - 195
  • [39] Transformer-based Multi-scale Underwater Image Enhancement Network
    Yang, Ai-Ping
    Fang, Si-Jie
    Shao, Ming-Fu
    Zhang, Teng-Fei
    Dongbei Daxue Xuebao/Journal of Northeastern University, 2024, 45 (12): : 1696 - 1705
  • [40] Hierarchical Transformer with Multi-Scale Parallel Aggregation for Breast Tumor Segmentation
    Xia, Ping
    Wang, Yudie
    Lei, Bangjun
    Peng, Cheng
    Zhang, Guangyi
    Tang, Tinglong
    LASER & OPTOELECTRONICS PROGRESS, 2025, 62 (02)