Enhancing Human Action Recognition with Fine-grained Body Movement Attention

被引：0

作者：

Zhang, Rui ^{[1
]}

Xue, Junxiao ^{[2
]}

Lin, Feng ^{[3
]}

Zhang, Qing ^{[3
]}

Smirnov, Pavel ^{[4
]}

Ma, Xiao ^{[3
]}

Yan, Xiaoran ^{[1
]}

机构：

[1] Zhejiang Lab, Res Ctr Data Hub & Secur, Hangzhou, Zhejiang, Peoples R China

[2] Zhejiang Lab, Res Ctr Space Based Comp Syst, Hangzhou, Zhejiang, Peoples R China

[3] Zhejiang Lab, Res Ctr Frontier Fundamental Studies, Hangzhou, Zhejiang, Peoples R China

[4] Zhejiang Lab, Res Ctr Astron Comp, Hangzhou, Zhejiang, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME 2024 | 2024年

关键词：

Action Recognition; Multi-modal Learning; Contrastive Learning; Vision-Language Model;

D O I：

10.1109/ICME57554.2024.10688034

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In the field of vision-language models (VLMs), human action recognition models, while effective, always rely on large pre-trained models or high-resolution inputs, leading to computational challenges. To address this, we propose a novel VLM approach with fine-grained attention to body movements. Unlike methods relying on coarse video-text matching, we guide the model to infer actions from fine-grained body part movements using two techniques: fine-tuning pre-trained encoders at the fine-grained level and matching labels from language and vision perspectives at the coarse-grained level. Experiments show our model excels in fully-supervised, few-shot, and zero-shot scenarios with just 8 random frames and a ViT-B/32 backbone. It outperforms most ViT-L/14 based models, demonstrating effectiveness while saving computational resources. The largest Top-1 accuracy improvement over second-best approaches is 6.8%.

引用

页数：6

共 22 条

[1] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[2]

Bertasius G, 2021, PR MACH LEARN RES, V139

[3]

Cai Jianping, 2016, Graphical Simulation of Deformable Models

[4]

Chen Y., 2023, ARXIV

[5] Inter-Cell Slicing Resource Partitioning via Coordinated Multi-Agent Deep Reinforcement Learning [J].

Hu, Tianlun ;

Liao, Qi ;

Liu, Qiang ;

Wellington, Dan ;

Carle, Georg .

IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC 2022), 2022, :3202-3207

[6] Better exploiting motion for better action recognition [J].

Jain, Mihir ;

Jegou, Herve ;

Bouthemy, Patrick .

2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, :2555-2562

[7]

Jia C, 2021, PR MACH LEARN RES, V139

[8] Phase extraction from arbitrary phase-shifted fringe patterns with noise suppression [J].

Kemao, Qian ;

Wang, Haixia ;

Gao, Wenjing ;

Feng, Lin ;

Soon, Seah Hock .

OPTICS AND LASERS IN ENGINEERING, 2010, 48 (06) :684-689

[9]

Li X., 2020, INT C LEARN REPR

[10]

Li YL, 2020, PROC CVPR IEEE, P379, DOI [10.1109/CVPR42600.2020.00046, 10.1109/ICEMME51517.2020.00080]

← 1 2 3 →