MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

被引:1
作者
Huo, Hua [1 ]
Li, Bingjie [1 ]
机构
[1] Henan Univ Sci & Technol, Informat Engn Coll, Luoyang 471000, Peoples R China
基金
中国国家自然科学基金;
关键词
action recognition; multi-granularity multi-scale fusion; vision transformer; efficiency;
D O I
10.3390/electronics13050948
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Nowadays, the field of video-based action recognition is rapidly developing. Although Vision Transformers (ViT) have made great progress in static image processing, they are not yet fully optimized for dynamic video applications. Convolutional Neural Networks (CNN) and related models perform exceptionally well in video action recognition. However, there are still some issues that cannot be ignored, such as high computational costs and large memory consumption. In the face of these issues, current research focuses on finding effective methods to improve model performance and overcome current limits. Therefore, we present a unique Vision Transformer model based on multi-granularity and multi-scale fusion to accomplish efficient action recognition, which is designed for action recognition in videos to effectively reduce computational costs and memory usage. Firstly, we devise a multi-scale, multi-granularity module that integrates with Transformer blocks. Secondly, a hierarchical structure is utilized to manage information at various scales, and we introduce multi-granularity on top of multi-scale, which allows for a selective choice of the number of tokens to enter the next computational step, thereby reducing redundant tokens. Thirdly, a coarse-fine granularity fusion layer is introduced to reduce the sequence length of tokens with lower information content. The above two mechanisms are combined to optimize the allocation of resources in the model, further emphasizing critical information and reducing redundancy, thereby minimizing computational costs. To assess our proposed approach, comprehensive experiments are conducted by using benchmark datasets in the action recognition domain. The experimental results demonstrate that our method has achieved state-of-the-art performance in terms of accuracy and efficiency.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Dynamic Channel Token Vision Transformer with linear computation complexity and multi-scale features
    Guan, Yijia
    Wang, Kundong
    NEUROCOMPUTING, 2025, 630
  • [22] Multi-Granularity Anchor-Contrastive Representation Learning for Semi-Supervised Skeleton-Based Action Recognition
    Shu, Xiangbo
    Xu, Binqian
    Zhang, Liyan
    Tang, Jinhui
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7559 - 7576
  • [23] DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation
    Li, Ke
    Wang, Di
    Liu, Gang
    Zhu, Wenxuan
    Zhong, Haodi
    Wang, Quan
    NEURAL NETWORKS, 2024, 180
  • [24] Multi-scale vision transformer classification model with self-supervised learning and dilated convolution
    Xing, Liping
    Jin, Hongmei
    Li, Hong-an
    Li, Zhanli
    COMPUTERS & ELECTRICAL ENGINEERING, 2022, 103
  • [25] Multi-scale Knowledge Transfer Vision Transformer for 3D vessel shape segmentation
    Hua, Michael J.
    Wu, Junjie
    Zhong, Zichun
    COMPUTERS & GRAPHICS-UK, 2024, 122
  • [26] Multi-Scale Transformer Network for Hyperspectral Image Denoising
    Hu, Shuai
    Hu, Yikun
    Lin, Junyan
    Gao, Feng
    Dong, Junyu
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 7328 - 7331
  • [27] Multi-scale transformer with conditioned prompt for image deraining
    Wu, Xianhao
    Chen, Hongming
    Chen, Xiang
    Xu, Guili
    DIGITAL SIGNAL PROCESSING, 2025, 156
  • [28] Multi-tailed vision transformer for efficient inference
    Wang, Yunke
    Du, Bo
    Wang, Wenyuan
    Xu, Chang
    NEURAL NETWORKS, 2024, 174
  • [29] Multi-scale Transformer with Decoder for Image Quality Assessment
    Zhang, Shuai
    Liu, Yutao
    ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I, 2024, 14473 : 220 - 231
  • [30] Facial expression-based emotion recognition across diverse age groups: a multi-scale vision transformer with contrastive learning approach
    Balachandran, G.
    Ranjith, S.
    Chenthil, T. R.
    Jagan, G. C.
    JOURNAL OF COMBINATORIAL OPTIMIZATION, 2025, 49 (01)