Fusion Attention for Action Recognition: Integrating Sparse-Dense and Global Attention for Video Action Recognition

被引:0
作者
Kim, Hyun-Woo [1 ]
Choi, Yong-Suk [2 ]
机构
[1] Hanyang Univ, Dept Artificial Intelligence Applicat, Seoul 04763, South Korea
[2] Hanyang Univ, Dept Comp Sci & Engn, Seoul 04763, South Korea
基金
新加坡国家研究基金会;
关键词
action recognition; fusion attention; temporal redundancy;
D O I
10.3390/s24216842
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Conventional approaches to video action recognition perform global attention over the entire video patches, which may be ineffective due to the temporal redundancy of video frames. Recent works on masked video modeling adopt a high-ratio tube masking and reconstruction strategy as a pre-training method to mitigate the problem of focusing on spatial features well but not on temporal features. Inspired by this pre-training method, we propose Fusion Attention for Action Recognition (FAR), which fuses the sparse-dense attention patterns specialized for temporal features with global attention during fine-tuning. FAR has three main components: head-split sparse-dense attention (HSDA), token-group interaction, and group-averaged classifier. First, HSDA splits the head of multi-head self-attention to fuse global and sparse-dense attention. The sparse-dense attention is divided into groups of tube-shaped patches to focus on temporal features. Second, token-group interaction is used to improve information exchange between divided patch groups. Finally, the group-averaged classifier uses spatio-temporal features from different patch groups to improve performance. The proposed method uses the weight parameters that are pre-trained with VideoMAE and MVD, and achieves higher performance (+0.1-0.4%) with less computation than models fine-tuned with global attention on Something-Something V2 and Kinetics-400. Moreover, qualitative comparisons show that FAR captures temporal features quite well in highly redundant video frames. The FAR approach demonstrates improved action recognition with efficient computation, and exploring its adaptability across different pre-training methods presents an interesting direction for future research.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] Spatiotemporal Fusion Networks for Video Action Recognition
    Zheng Liu
    Haifeng Hu
    Junxuan Zhang
    Neural Processing Letters, 2019, 50 : 1877 - 1890
  • [32] Spatiotemporal Fusion Networks for Video Action Recognition
    Liu, Zheng
    Hu, Haifeng
    Zhang, Junxuan
    NEURAL PROCESSING LETTERS, 2019, 50 (02) : 1877 - 1890
  • [33] An overview of sparse representation based action recognition in video
    Ushapreethi, P.
    Lakshmipriya, G. G.
    2018 2ND INTERNATIONAL CONFERENCE ON COMPUTER, COMMUNICATION, AND SIGNAL PROCESSING (ICCCSP): SPECIAL FOCUS ON TECHNOLOGY AND INNOVATION FOR SMART ENVIRONMENT, 2018, : 63 - 67
  • [34] Joint spatial-temporal attention for action recognition
    Yu, Tingzhao
    Guo, Chaoxu
    Wang, Lingfeng
    Gu, Huxiang
    Xiang, Shiming
    Pan, Chunhong
    PATTERN RECOGNITION LETTERS, 2018, 112 : 226 - 233
  • [35] k-NN attention-based video vision transformer for action recognition
    Sun, Weirong
    Ma, Yujun
    Wang, Ruili
    NEUROCOMPUTING, 2024, 574
  • [36] ACTION RECOGNITION IMPROVED BY CORRELATIONS AND ATTENTION OF SUBJECTS AND SCENE
    Ha, Manh-Hung
    Chen, Oscal Tzyh-Chiang
    2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
  • [37] Multibranch Attention Networks for Action Recognition in Still Images
    Yan, Shiyang
    Smith, Jeremy S.
    Lu, Wenjin
    Zhang, Bailing
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2018, 10 (04) : 1116 - 1125
  • [38] Context-Aware Memory Attention Network for Video-Based Action Recognition
    Koh, Thean Chun
    Yeo, Chai Kiat
    Vaitesswar, U. S.
    Jing, Xuan
    2022 IEEE 14TH IMAGE, VIDEO, AND MULTIDIMENSIONAL SIGNAL PROCESSING WORKSHOP (IVMSP), 2022,
  • [39] An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition
    Alfasly, Saghir
    Chui, Charles K.
    Jiang, Qingtang
    Lu, Jian
    Xu, Chen
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 2496 - 2509
  • [40] Content-Aware Attention Network for Action Recognition
    Liu, Ziyi
    Wang, Le
    Zheng, Nanning
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2018, 2018, 519 : 109 - 120