Enhancing Human Action Recognition with Fine-grained Body Movement Attention

被引:0
作者
Zhang, Rui [1 ]
Xue, Junxiao [2 ]
Lin, Feng [3 ]
Zhang, Qing [3 ]
Smirnov, Pavel [4 ]
Ma, Xiao [3 ]
Yan, Xiaoran [1 ]
机构
[1] Zhejiang Lab, Res Ctr Data Hub & Secur, Hangzhou, Zhejiang, Peoples R China
[2] Zhejiang Lab, Res Ctr Space Based Comp Syst, Hangzhou, Zhejiang, Peoples R China
[3] Zhejiang Lab, Res Ctr Frontier Fundamental Studies, Hangzhou, Zhejiang, Peoples R China
[4] Zhejiang Lab, Res Ctr Astron Comp, Hangzhou, Zhejiang, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME 2024 | 2024年
关键词
Action Recognition; Multi-modal Learning; Contrastive Learning; Vision-Language Model;
D O I
10.1109/ICME57554.2024.10688034
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the field of vision-language models (VLMs), human action recognition models, while effective, always rely on large pre-trained models or high-resolution inputs, leading to computational challenges. To address this, we propose a novel VLM approach with fine-grained attention to body movements. Unlike methods relying on coarse video-text matching, we guide the model to infer actions from fine-grained body part movements using two techniques: fine-tuning pre-trained encoders at the fine-grained level and matching labels from language and vision perspectives at the coarse-grained level. Experiments show our model excels in fully-supervised, few-shot, and zero-shot scenarios with just 8 random frames and a ViT-B/32 backbone. It outperforms most ViT-L/14 based models, demonstrating effectiveness while saving computational resources. The largest Top-1 accuracy improvement over second-best approaches is 6.8%.
引用
收藏
页数:6
相关论文
共 22 条
[1]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[2]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[3]  
Cai Jianping, 2016, Graphical Simulation of Deformable Models
[4]  
Chen Y., 2023, ARXIV
[5]   Inter-Cell Slicing Resource Partitioning via Coordinated Multi-Agent Deep Reinforcement Learning [J].
Hu, Tianlun ;
Liao, Qi ;
Liu, Qiang ;
Wellington, Dan ;
Carle, Georg .
IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2022), 2022, :3202-3207
[6]   Better exploiting motion for better action recognition [J].
Jain, Mihir ;
Jegou, Herve ;
Bouthemy, Patrick .
2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, :2555-2562
[7]  
Jia C, 2021, PR MACH LEARN RES, V139
[8]   Phase extraction from arbitrary phase-shifted fringe patterns with noise suppression [J].
Kemao, Qian ;
Wang, Haixia ;
Gao, Wenjing ;
Feng, Lin ;
Soon, Seah Hock .
OPTICS AND LASERS IN ENGINEERING, 2010, 48 (06) :684-689
[9]  
Li X., 2020, INT C LEARN REPR
[10]  
Li YL, 2020, PROC CVPR IEEE, P379, DOI [10.1109/CVPR42600.2020.00046, 10.1109/ICEMME51517.2020.00080]