Masked Autoencoders for Spatial-Temporal Relationship in Video-Based Group Activity Recognition

被引：0

作者：

Yadav, Rajeshwar ^{[1
]}

Halder, Raju ^{[1
]}

Banda, Gourinath ^{[2
]}

机构：

[1] Indian Inst Technol Patna, Dept Comp Sci & Engn, Bihta 801106, India

[2] Indian Inst Technol Indore, Dept Comp Sci & Engn, Indore 453552, Madhya Pradesh, India

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Group activity recognition (GAR); hostage crime; IITP hostage dataset; spatial and temporal interaction; vision transformer; masked autoencoder;

D O I：

10.1109/ACCESS.2024.3457024

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Group Activity Recognition (GAR) is a challenging problem involving several intricacies. The core of GAR lies in delving into spatiotemporal features to generate appropriate scene representations. Previous methods, however, either feature a complex framework requiring individual action labels or need more adequate modelling of spatial and temporal features. To address these concerns, we propose a masking strategy for learning task-specific GAR scene representations through reconstruction. Furthermore, we elucidate how this methodology can effectively capture task-specific spatiotemporal features. In particular, three notable findings emerge from our framework: 1) GAR is simplified, eliminating the need for individual action labels; 2) the generation of target-specific spatiotemporal features yields favourable outcomes for various datasets; and 3) this method demonstrates effectiveness even for datasets with a small number of videos, highlighting its capability with limited training data. Further, the existing GAR datasets have fewer videos per class and only a few actors are considered, restricting the existing model from being generalised effectively. To this aim, we introduce 923 videos for a crime activity named IITP Hostage, which contains two categories, hostage and non-hostage. To our knowledge, this is the first attempt to recognize crime-based activities in GAR. Our framework achieves MCA of 96.8%, 97.0%, 97.0% on Collective Activity Dataset (CAD), new CAD, extended CAD datasets and 84.3%, 95.6%, 96.78% for IITP Hostage, hostage+CAD and subset of UCF crime datasets. The hostage and non-hostage scenarios introduce additional complexity, making it more challenging for the model to accurately recognize the activities compared to hostage+CAD and other datasets. This observation underscores the necessity to delve deeper into the complexity of GAR activities.

引用

页码：132084 / 132095

页数：12

共 9 条

[1] Research on a Flower Recognition Method Based on Masked Autoencoders
Li, Yin
Lv, Yang
Ding, Yuhang
Zhu, Haotian
Gao, Hua
Zheng, Lifei
HORTICULTURAE, 2024, 10 (05)
[2] Deeply Coupled Convolution-Transformer With Spatial-Temporal Complementary Learning for Video-Based Person Re-Identification
Liu, Xuehu
Yu, Chenyang
Zhang, Pingping
Lu, Huchuan
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (10) : 13753 - 13763
[3] Spatial-Temporal Sequence Attention Based Efficient Transformer for Video Snow Removal
Gao, Tao
Zhang, Qianxi
Chen, Ting
Wen, Yuanbo
BIG DATA MINING AND ANALYTICS, 2025, 8 (03): : 551 - 562
[4] Spatial-Temporal Transformer Network for Continuous Action Recognition in Industrial Assembly
Huang, Jianfeng
Liu, Xiang
Hu, Huan
Tang, Shanghua
Li, Chenyang
Zhao, Shaoan
Lin, Yimin
Wang, Kai
Liu, Zhaoxiang
Lian, Shiguo
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT X, ICIC 2024, 2024, 14871 : 114 - 130
[5] Collaborative spatial-temporal video salient object detection with cross attention transformer
Su, Yuting
Wang, Weikang
Liu, Jing
Jing, Peiguang
SIGNAL PROCESSING, 2024, 224
[6] A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion
Jiang, Junkun
Chen, Jie
Guo, Yike
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5123 - 5131
[7] Video-Based Martial Arts Combat Action Recognition and Position Detection Using Deep Learning
Wu, Baoyuan
Zhou, Jiali
IEEE ACCESS, 2024, 12 : 161357 - 161374
[8] A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer
Zhang, Hui
Yang, Jiewen
Dong, Xingbo
Lv, Xingguo
Jia, Wei
Jin, Zhe
Li, Xuejun
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT V, 2024, 14429 : 29 - 43
[9] HASI: Hierarchical Attention-Aware Spatio-Temporal Interaction for Video-Based Person Re-Identification
Chen, Si
Da, Hui
Wang, Da-Han
Zhang, Xu-Yao
Yan, Yan
Zhu, Shunzhi
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (06) : 4973 - 4988

← 1 →