MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

被引:1
作者
Huo, Hua [1 ]
Li, Bingjie [1 ]
机构
[1] Henan Univ Sci & Technol, Informat Engn Coll, Luoyang 471000, Peoples R China
基金
中国国家自然科学基金;
关键词
action recognition; multi-granularity multi-scale fusion; vision transformer; efficiency;
D O I
10.3390/electronics13050948
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Nowadays, the field of video-based action recognition is rapidly developing. Although Vision Transformers (ViT) have made great progress in static image processing, they are not yet fully optimized for dynamic video applications. Convolutional Neural Networks (CNN) and related models perform exceptionally well in video action recognition. However, there are still some issues that cannot be ignored, such as high computational costs and large memory consumption. In the face of these issues, current research focuses on finding effective methods to improve model performance and overcome current limits. Therefore, we present a unique Vision Transformer model based on multi-granularity and multi-scale fusion to accomplish efficient action recognition, which is designed for action recognition in videos to effectively reduce computational costs and memory usage. Firstly, we devise a multi-scale, multi-granularity module that integrates with Transformer blocks. Secondly, a hierarchical structure is utilized to manage information at various scales, and we introduce multi-granularity on top of multi-scale, which allows for a selective choice of the number of tokens to enter the next computational step, thereby reducing redundant tokens. Thirdly, a coarse-fine granularity fusion layer is introduced to reduce the sequence length of tokens with lower information content. The above two mechanisms are combined to optimize the allocation of resources in the model, further emphasizing critical information and reducing redundancy, thereby minimizing computational costs. To assess our proposed approach, comprehensive experiments are conducted by using benchmark datasets in the action recognition domain. The experimental results demonstrate that our method has achieved state-of-the-art performance in terms of accuracy and efficiency.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] 3D multi-scale vision transformer for lung nodule detection in chest CT images
    Mkindu, Hassan
    Wu, Longwen
    Zhao, Yaqin
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (05) : 2473 - 2480
  • [32] Multi-scale spatiotemporal topology unveiled: enhancing skeleton-based action recognition
    Chen, Hongwei
    Wang, Jianpeng
    Chen, Zexi
    [J]. JOURNAL OF SUPERCOMPUTING, 2025, 81 (01)
  • [33] 3D multi-scale vision transformer for lung nodule detection in chest CT images
    Hassan Mkindu
    Longwen Wu
    Yaqin Zhao
    [J]. Signal, Image and Video Processing, 2023, 17 : 2473 - 2480
  • [34] MPCFusion: Multi-scale parallel cross fusion for infrared and visible images via convolution and vision Transformer
    Tang, Haojie
    Qian, Yao
    Xing, Mengliang
    Cao, Yisheng
    Liu, Gang
    [J]. OPTICS AND LASERS IN ENGINEERING, 2024, 176
  • [35] Multi-Scale Adaptive Graph Convolution Network for Skeleton-Based Action Recognition
    Hu, Huangshui
    Fang, Yue
    Han, Mei
    Qi, Xingshuo
    [J]. IEEE ACCESS, 2024, 12 : 16868 - 16880
  • [36] Multi-scale temporal feature-based dense convolutional network for action recognition
    Li, Xiaoqiang
    Xie, Miao
    Zhang, Yin
    Li, Jide
    [J]. JOURNAL OF ELECTRONIC IMAGING, 2020, 29 (06)
  • [37] Temporal Shift Vision Transformer Adapter for Efficient Video Action Recognition
    Shi, Yaning
    Sun, Pu
    Gu, Bing
    Li, Longfei
    [J]. PROCEEDINGS OF 2024 4TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND INTELLIGENT COMPUTING, BIC 2024, 2024, : 42 - 46
  • [38] Accurate Facial Landmark Detector via Multi-scale Transformer
    Sha, Yuyang
    Meng, Weiyu
    Zhai, Xiaobing
    Xie, Can
    Li, Kefeng
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT V, 2024, 14429 : 278 - 290
  • [39] A Novel Multi-Scale Transformer for Object Detection in Aerial Scenes
    Lu, Guanlin
    He, Xiaohui
    Wang, Qiang
    Shao, Faming
    Wang, Hongwei
    Wang, Jinkang
    [J]. DRONES, 2022, 6 (08)
  • [40] DMCCT: Dual-Branch Multi-Granularity Convolutional Cross-Substitution Transformer for Hyperspectral Image Classification
    Fu, Laiying
    Chen, Xiaoyong
    Xu, Yanan
    Li, Xiao
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (20):