A survey on deep learning-based spatio-temporal action detection

被引:1
作者
Wang, Peng [1 ]
Zeng, Fanwei [2 ]
Qian, Yuntao [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, Hangzhou 310007, Zhejiang, Peoples R China
[2] Ant Grp, Hangzhou 310007, Zhejiang, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Computer vision; deep learning; spatio-temporal action detection; SEARCH;
D O I
10.1142/S0219691323500662
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Spatio-temporal action detection (STAD) aims to classify the actions present in a video and localize them in space and time. It has become a particularly active area of research in computer vision because of its explosively emerging real-world applications, such as autonomous driving, visual surveillance and entertainment. Many efforts have been devoted in recent years to build a robust and effective framework for STAD. This paper provides a comprehensive review of the state-of-the-art deep learning-based methods for STAD. First, a taxonomy is developed to organize these methods. Next, the linking algorithms, which aim to associate the frame- or clip-level detection results together to form action tubes, are reviewed. Then, the commonly used benchmark datasets and evaluation metrics are introduced, and the performance of state-of-the-art models is compared. At last, this paper is concluded, and a set of potential research directions of STAD are discussed.
引用
收藏
页数:35
相关论文
共 113 条
  • [1] Hybrid Classifiers for Spatio-Temporal Abnormal Behavior Detection, Tracking, and Recognition in Massive Hajj Crowds
    Alafif, Tarik
    Hadi, Anas
    Allahyani, Manal
    Alzahrani, Bander
    Alhothali, Areej
    Alotaibi, Reem
    Barnawi, Ahmed
    [J]. ELECTRONICS, 2023, 12 (05)
  • [2] Generative adversarial network based abnormal behavior detection in massive crowd videos: a Hajj case study
    Alafif, Tarik
    Alzahrani, Bander
    Cao, Yong
    Alotaibi, Reem
    Barnawi, Ahmed
    Chen, Min
    [J]. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2021, 13 (8) : 4077 - 4088
  • [3] CNN-Based Multiple Path Search for Action Tube Detection in Videos
    Alwando, Erick Hendra Putra
    Chen, Yie-Tarng
    Fang, Wen-Hsien
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (01) : 104 - 116
  • [4] Bhoi A., ARXIV
  • [5] Blank M, 2005, IEEE I CONF COMP VIS, P1395
  • [6] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [7] Cross-Dataset Action Detection
    Cao, Liangliang
    Liu, Zicheng
    Huang, Thomas S.
    [J]. 2010 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2010, : 1998 - 2005
  • [8] Carion Nicolas, 2020, EUR C COMP VIS, P213, DOI [10.48550/arXiv. 2005.12872, DOI 10.48550/ARXIV.2005.12872, 10.1007/978-3-030-58452-813, DOI 10.1007/978-3-030-58452-813]
  • [9] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [10] Chen G., ARXIV