Masked Autoencoders for Spatial-Temporal Relationship in Video-Based Group Activity Recognition

被引:0
|
作者
Yadav, Rajeshwar [1 ]
Halder, Raju [1 ]
Banda, Gourinath [2 ]
机构
[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Bihta 801106, India
[2] Indian Inst Technol Indore, Dept Comp Sci & Engn, Indore 453552, Madhya Pradesh, India
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Group activity recognition (GAR); hostage crime; IITP hostage dataset; spatial and temporal interaction; vision transformer; masked autoencoder;
D O I
10.1109/ACCESS.2024.3457024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Group Activity Recognition (GAR) is a challenging problem involving several intricacies. The core of GAR lies in delving into spatiotemporal features to generate appropriate scene representations. Previous methods, however, either feature a complex framework requiring individual action labels or need more adequate modelling of spatial and temporal features. To address these concerns, we propose a masking strategy for learning task-specific GAR scene representations through reconstruction. Furthermore, we elucidate how this methodology can effectively capture task-specific spatiotemporal features. In particular, three notable findings emerge from our framework: 1) GAR is simplified, eliminating the need for individual action labels; 2) the generation of target-specific spatiotemporal features yields favourable outcomes for various datasets; and 3) this method demonstrates effectiveness even for datasets with a small number of videos, highlighting its capability with limited training data. Further, the existing GAR datasets have fewer videos per class and only a few actors are considered, restricting the existing model from being generalised effectively. To this aim, we introduce 923 videos for a crime activity named IITP Hostage, which contains two categories, hostage and non-hostage. To our knowledge, this is the first attempt to recognize crime-based activities in GAR. Our framework achieves MCA of 96.8%, 97.0%, 97.0% on Collective Activity Dataset (CAD), new CAD, extended CAD datasets and 84.3%, 95.6%, 96.78% for IITP Hostage, hostage+CAD and subset of UCF crime datasets. The hostage and non-hostage scenarios introduce additional complexity, making it more challenging for the model to accurately recognize the activities compared to hostage+CAD and other datasets. This observation underscores the necessity to delve deeper into the complexity of GAR activities.
引用
收藏
页码:132084 / 132095
页数:12
相关论文
共 9 条
  • [1] Research on a Flower Recognition Method Based on Masked Autoencoders
    Li, Yin
    Lv, Yang
    Ding, Yuhang
    Zhu, Haotian
    Gao, Hua
    Zheng, Lifei
    HORTICULTURAE, 2024, 10 (05)
  • [2] Deeply Coupled Convolution-Transformer With Spatial-Temporal Complementary Learning for Video-Based Person Re-Identification
    Liu, Xuehu
    Yu, Chenyang
    Zhang, Pingping
    Lu, Huchuan
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (10) : 13753 - 13763
  • [3] Spatial-Temporal Sequence Attention Based Efficient Transformer for Video Snow Removal
    Gao, Tao
    Zhang, Qianxi
    Chen, Ting
    Wen, Yuanbo
    BIG DATA MINING AND ANALYTICS, 2025, 8 (03): : 551 - 562
  • [4] Spatial-Temporal Transformer Network for Continuous Action Recognition in Industrial Assembly
    Huang, Jianfeng
    Liu, Xiang
    Hu, Huan
    Tang, Shanghua
    Li, Chenyang
    Zhao, Shaoan
    Lin, Yimin
    Wang, Kai
    Liu, Zhaoxiang
    Lian, Shiguo
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT X, ICIC 2024, 2024, 14871 : 114 - 130
  • [5] Collaborative spatial-temporal video salient object detection with cross attention transformer
    Su, Yuting
    Wang, Weikang
    Liu, Jing
    Jing, Peiguang
    SIGNAL PROCESSING, 2024, 224
  • [6] A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion
    Jiang, Junkun
    Chen, Jie
    Guo, Yike
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5123 - 5131
  • [7] Video-Based Martial Arts Combat Action Recognition and Position Detection Using Deep Learning
    Wu, Baoyuan
    Zhou, Jiali
    IEEE ACCESS, 2024, 12 : 161357 - 161374
  • [8] A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer
    Zhang, Hui
    Yang, Jiewen
    Dong, Xingbo
    Lv, Xingguo
    Jia, Wei
    Jin, Zhe
    Li, Xuejun
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT V, 2024, 14429 : 29 - 43
  • [9] HASI: Hierarchical Attention-Aware Spatio-Temporal Interaction for Video-Based Person Re-Identification
    Chen, Si
    Da, Hui
    Wang, Da-Han
    Zhang, Xu-Yao
    Yan, Yan
    Zhu, Shunzhi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (06) : 4973 - 4988