Fusion Attention for Action Recognition: Integrating Sparse-Dense and Global Attention for Video Action Recognition

被引：1

作者：

Kim, Hyun-Woo ^{[1
]}

Choi, Yong-Suk ^{[2
]}

机构：

[1] Hanyang Univ, Dept Artificial Intelligence Applicat, Seoul 04763, South Korea

[2] Hanyang Univ, Dept Comp Sci & Engn, Seoul 04763, South Korea

来源：

SENSORS | 2024年 / 24卷 / 21期

基金：

新加坡国家研究基金会;

关键词：

action recognition; fusion attention; temporal redundancy;

D O I：

10.3390/s24216842

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Conventional approaches to video action recognition perform global attention over the entire video patches, which may be ineffective due to the temporal redundancy of video frames. Recent works on masked video modeling adopt a high-ratio tube masking and reconstruction strategy as a pre-training method to mitigate the problem of focusing on spatial features well but not on temporal features. Inspired by this pre-training method, we propose Fusion Attention for Action Recognition (FAR), which fuses the sparse-dense attention patterns specialized for temporal features with global attention during fine-tuning. FAR has three main components: head-split sparse-dense attention (HSDA), token-group interaction, and group-averaged classifier. First, HSDA splits the head of multi-head self-attention to fuse global and sparse-dense attention. The sparse-dense attention is divided into groups of tube-shaped patches to focus on temporal features. Second, token-group interaction is used to improve information exchange between divided patch groups. Finally, the group-averaged classifier uses spatio-temporal features from different patch groups to improve performance. The proposed method uses the weight parameters that are pre-trained with VideoMAE and MVD, and achieves higher performance (+0.1-0.4%) with less computation than models fine-tuned with global attention on Something-Something V2 and Kinetics-400. Moreover, qualitative comparisons show that FAR captures temporal features quite well in highly redundant video frames. The FAR approach demonstrates improved action recognition with efficient computation, and exploring its adaptability across different pre-training methods presents an interesting direction for future research.

引用

页数：18

共 50 条

[41] k-NN attention-based video vision transformer for action recognition
Sun, Weirong
Ma, Yujun
Wang, Ruili
NEUROCOMPUTING, 2024, 574
[42] Context-Aware Memory Attention Network for Video-Based Action Recognition
Koh, Thean Chun
Yeo, Chai Kiat
Vaitesswar, U. S.
Jing, Xuan
2022 IEEE 14TH IMAGE, VIDEO, AND MULTIDIMENSIONAL SIGNAL PROCESSING WORKSHOP (IVMSP), 2022,
[43] Spatial-temporal saliency action mask attention network for action recognition
Jiang, Min
Pan, Na
Kong, Jun
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 71
[44] Dense Semantics-Assisted Networks for Video Action Recognition
Luo, Haonan
Lin, Guosheng
Yao, Yazhou
Tang, Zhenmin
Wu, Qingyao
Hua, Xiansheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (05) : 3073 - 3084
[45] Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning
Li, Chenhao
Zhang, Jing
Yao, Jiacheng
NEUROCOMPUTING, 2021, 453 : 383 - 392
[46] STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition
Dasom Ahn
Sangwon Kim
Byoung Chul Ko
Applied Intelligence, 2023, 53 : 28446 - 28459
[47] Dual Stream Spatio-Temporal Motion Fusion With Self-Attention For Action Recognition
Jalal, Md Asif
Aftab, Waqas
Moore, Roger K.
Mihaylova, Lyudmila
2019 22ND INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION 2019), 2019,
[48] Temporal-Channel Attention and Convolution Fusion for Skeleton-Based Human Action Recognition
Liang, Chengwu
Yang, Jie
Du, Ruolin
Hu, Wei
Hou, Ning
IEEE ACCESS, 2024, 12 : 64937 - 64948
[49] Attention-Based Multiview Re-Observation Fusion Network for Skeletal Action Recognition
Fan, Zhaoxuan
Zhao, Xu
Lin, Tianwei
Su, Haisheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (02) : 363 - 374
[50] Action Recognition with a Multi-View Temporal Attention Network
Sun, Dengdi
Su, Zhixiang
Ding, Zhuanlian
Luo, Bin
COGNITIVE COMPUTATION, 2022, 14 (03) : 1082 - 1095

← 1 2 3 4 5 →