Temporal Shift Module-Based Vision Transformer Network for Action Recognition

被引:1
|
作者
Zhang, Kunpeng [1 ]
Lyu, Mengyan [1 ]
Guo, Xinxin [1 ]
Zhang, Liye [1 ]
Liu, Cong [1 ]
机构
[1] Shandong Univ Technol, Coll Comp Sci & Technol, Zibo 255000, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Computational modeling; Convolutional neural networks; Computer architecture; Task analysis; Image segmentation; Head; Action recognition; self-attention; temporal shift module; vision transformer;
D O I
10.1109/ACCESS.2024.3379885
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces a novel action recognition model named ViT-Shift, which combines the Time Shift Module (TSM) with the Vision Transformer (ViT) architecture. Traditional video action recognition tasks face significant computational challenges, requiring substantial computing resources. However, our model successfully addresses this issue by incorporating the TSM, achieving outstanding performance while significantly reducing computational costs. Our approach is based on the latest Transformer self-attention mechanism, applied to video sequence processing instead of traditional convolutional methods. To preserve the core architecture of ViT and transfer its excellent performance in image recognition to video action recognition, we strategically introduce the TSM only before the multi-head attention layer of ViT. This design allows us to simulate temporal interactions using channel shifts, effectively reducing computational complexity. We carefully design the position and shift parameters of the TSM to maximize the model's performance. Experimental results demonstrate that ViT-Shift achieves remarkable results on two standard action recognition datasets. With ImageNet-21K pretraining, we achieve an accuracy of 77.55% on the Kinetics-400 dataset and 93.07% on the UCF-101 dataset.
引用
收藏
页码:47246 / 47257
页数:12
相关论文
共 50 条
  • [41] Temporal Pyramid Pooling Based Relation Network for Action Recognition
    Zheng, Zhenxing
    An, Gaoyun
    Ruan, Qiuqi
    PROCEEDINGS OF 2018 14TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP), 2018, : 644 - 647
  • [42] TEMPORAL ATTENTIVE NETWORK FOR ACTION RECOGNITION
    Shi, Yemin
    Tian, Yonghong
    Huang, Tiejun
    Wang, Yaowei
    2018 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2018,
  • [43] Temporal Pyramid Network for Action Recognition
    Yang, Ceyuan
    Xu, Yinghao
    Shi, Jianping
    Dai, Bo
    Zhou, Bolei
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 588 - 597
  • [44] k-NN attention-based video vision transformer for action recognition
    Sun, Weirong
    Ma, Yujun
    Wang, Ruili
    NEUROCOMPUTING, 2024, 574
  • [45] Mixed Attention and Channel Shift Transformer for Efficient Action Recognition
    Lu, Xiusheng
    Hao, Yanbin
    Cheng, Lechao
    Zhao, Sicheng
    Li, Yutao
    Song, Mingli
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2025, 21 (03)
  • [46] Module-based visualization of large-scale graph network data
    Chenhui Li
    George Baciu
    Yunzhe Wang
    Journal of Visualization, 2017, 20 : 205 - 215
  • [47] Skeleton-Based Action Recognition with Shift Graph Convolutional Network
    Cheng, Ke
    Zhang, Yifan
    He, Xiangyu
    Chen, Weihan
    Cheng, Jian
    Lu, Hanqing
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 180 - 189
  • [48] Spatial-temporal graph transformer network for skeleton-based temporal action segmentation
    Tian, Xiaoyan
    Jin, Ye
    Zhang, Zhao
    Liu, Peng
    Tang, Xianglong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (15) : 44273 - 44297
  • [49] Spatial-temporal graph transformer network for skeleton-based temporal action segmentation
    Xiaoyan Tian
    Ye Jin
    Zhao Zhang
    Peng Liu
    Xianglong Tang
    Multimedia Tools and Applications, 2024, 83 : 44273 - 44297
  • [50] Module-based visualization of large-scale graph network data
    Li, Chenhui
    Baciu, George
    Wang, Yunzhe
    JOURNAL OF VISUALIZATION, 2017, 20 (02) : 205 - 215