MgMViT: Multi-Granularity and Multi-Scale Vision Transformer for Efficient Action Recognition

被引：1

作者：

Huo, Hua ^{[1
]}

Li, Bingjie ^{[1
]}

机构：

[1] Henan Univ Sci & Technol, Informat Engn Coll, Luoyang 471000, Peoples R China

来源：

ELECTRONICS | 2024年 / 13卷 / 05期

基金：

中国国家自然科学基金;

关键词：

action recognition; multi-granularity multi-scale fusion; vision transformer; efficiency;

D O I：

10.3390/electronics13050948

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Nowadays, the field of video-based action recognition is rapidly developing. Although Vision Transformers (ViT) have made great progress in static image processing, they are not yet fully optimized for dynamic video applications. Convolutional Neural Networks (CNN) and related models perform exceptionally well in video action recognition. However, there are still some issues that cannot be ignored, such as high computational costs and large memory consumption. In the face of these issues, current research focuses on finding effective methods to improve model performance and overcome current limits. Therefore, we present a unique Vision Transformer model based on multi-granularity and multi-scale fusion to accomplish efficient action recognition, which is designed for action recognition in videos to effectively reduce computational costs and memory usage. Firstly, we devise a multi-scale, multi-granularity module that integrates with Transformer blocks. Secondly, a hierarchical structure is utilized to manage information at various scales, and we introduce multi-granularity on top of multi-scale, which allows for a selective choice of the number of tokens to enter the next computational step, thereby reducing redundant tokens. Thirdly, a coarse-fine granularity fusion layer is introduced to reduce the sequence length of tokens with lower information content. The above two mechanisms are combined to optimize the allocation of resources in the model, further emphasizing critical information and reducing redundancy, thereby minimizing computational costs. To assess our proposed approach, comprehensive experiments are conducted by using benchmark datasets in the action recognition domain. The experimental results demonstrate that our method has achieved state-of-the-art performance in terms of accuracy and efficiency.

引用

页数：16

共 50 条

[1] Multi-granularity transformer fusion for temporal action localization
Zhang M.
Hu H.
Li Z.
Soft Computing, 2024, 28 (20) : 12377 - 12388
[2] Data-efficient multi-scale fusion vision transformer
Tang, Hao
Liu, Dawei
Shen, Chengchao
PATTERN RECOGNITION, 2025, 161
[3] DeepFake detection with multi-scale convolution and vision transformer
Lin, Hao
Huang, Wenmin
Luo, Weiqi
Lu, Wei
DIGITAL SIGNAL PROCESSING, 2023, 134
[4] DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition
Jiao, Jiayu
Tang, Yu-Ming
Lin, Kun-Yu
Gao, Yipeng
Ma, Andy J.
Wang, Yaowei
Zheng, Wei-Shi
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8906 - 8919
[5] MSAPVT: a multi-scale attention pyramid vision transformer network for large-scale fruit recognition
Rao, Yao
Li, Chaofeng
Xu, Feiran
Guo, Ya
JOURNAL OF FOOD MEASUREMENT AND CHARACTERIZATION, 2024, 18 (11) : 9233 - 9251
[6] A Multi-Scale Video Longformer Network for Action Recognition
Chen, Congping
Zhang, Chunsheng
Dong, Xin
APPLIED SCIENCES-BASEL, 2024, 14 (03):
[7] Evolution modeling with multi-scale smoothing for action recognition
Wang, Tingwei
Liu, Chuancai
Wang, Liantao
Ma, Bingxian
Gu, Xingjian
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2018, 55 : 778 - 788
[8] Hierarchical Multi-scale Attention Networks for action recognition
Yan, Shiyang
Smith, Jeremy S.
Lu, Wenjin
Zhang, Bailing
SIGNAL PROCESSING-IMAGE COMMUNICATION, 2018, 61 : 73 - 84
[9] MULTI-SCALE REGION CANDIDATE COMBINATION FOR ACTION RECOGNITION
Zhao, Zhichen
Ma, Huimin
Chen, Xiaozhi
2016 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2016, : 3071 - 3075
[10] MUP: Multi-granularity Unified Perception for Panoramic Activity Recognition
Cao, Meiqi
Yan, Rui
Shu, Xiangbo
Zhang, Jiachao
Wang, Jinpeng
Xie, Guo-Sen
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7666 - 7675

← 1 2 3 4 5 →