Video action detection by learning graph-based spatio-temporal interactions

被引:13
|
作者
Tomei, Matteo [1 ]
Baraldi, Lorenzo [1 ]
Calderara, Simone [1 ,2 ]
Bronzin, Simone [2 ]
Cucchiara, Rita [1 ]
机构
[1] Univ Modena & Reggio Emilia, Via Pietro Vivarelli 10, I-41125 Modena, Italy
[2] METALIQUID SRL, Via Giosue Carducci 26, I-20123 Milan, Italy
关键词
Video understanding; Action detection; Graph learning;
D O I
10.1016/j.cviu.2021.103187
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action Detection is a complex task that aims to detect and classify human actions in video clips. Typically, it has been addressed by processing fine-grained features extracted from a video classification backbone. Recently, thanks to the robustness of object and people detectors, a deeper focus has been added on relationship modeling. Following this line, we propose a graph-based framework to learn high-level interactions between people and objects, in both space and time. In our formulation, spatio-temporal relationships are learned through self-attention on a multi-layer graph structure which can connect entities from consecutive clips, thus considering long-range spatial and temporal dependencies. The proposed module is backbone independent by design and does not require end-to-end training. Extensive experiments are conducted on the AVA dataset, where our model demonstrates state-of-the-art results and consistent improvements over baselines built with different backbones. Code is publicly available at https://github.com/aimagelab/STAGE_action_detection.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Spatio-temporal graph-based self-labeling for video anomaly detection
    Xing, Meng
    Feng, Zhiyong
    Su, Yong
    Zhang, Yiming
    Oh, Changjae
    Gribova, Valeriya
    Filaretoy, Vladimir Fedorovich
    Huang, Deshuang
    NEUROCOMPUTING, 2025, 627
  • [2] Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning
    Li, Shun
    Zhang, Ze-Fan
    Ji, Yi
    Li, Ying
    Liu, Chun-Ping
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [3] STEP: Spatio-Temporal Progressive Learning for Video Action Detection
    Yang, Xitong
    Yang, Xiaodong
    Liu, Ming-Yu
    Xiao, Fanyi
    Davis, Larry
    Kautz, Jan
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 264 - 272
  • [4] Video Relation Detection with Spatio-Temporal Graph
    Qian, Xufeng
    Zhuang, Yueting
    Li, Yimeng
    Xiao, Shaoning
    Pu, Shiliang
    Xiao, Jun
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 84 - 93
  • [5] Graph-based spatio-temporal region extraction
    Galmar, Eric
    Huet, Benoit
    IMAGE ANALYSIS AND RECOGNITION, PT 1, 2006, 4141 : 236 - 247
  • [6] Graph-Based Spatio-Temporal Feature Learning for Neuromorphic Vision Sensing
    Bi, Yin
    Chadha, Aaron
    Abbas, Alhabib
    Bourtsoulatze, Eirina
    Andreopoulos, Yiannis
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 9084 - 9098
  • [7] Graph-based approach for human action recognition using spatio-temporal features
    Ben Aoun, Najib
    Mejdoub, Mahmoud
    Ben Amar, Chokri
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2014, 25 (02) : 329 - 338
  • [8] Urban Event Detection from Spatio-temporal IoT Sensor Data Using Graph-Based Machine Learning
    Park, Dae-Young
    Ko, In-Young
    2022 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (IEEE BIGCOMP 2022), 2022, : 234 - 241
  • [9] Spatio-temporal graph-based CNNs for anomaly detection in weakly-labeled videos
    Mu, Huiyu
    Sun, Ruizhi
    Wang, Miao
    Chen, Zeqiu
    INFORMATION PROCESSING & MANAGEMENT, 2022, 59 (04)
  • [10] ENHANCED ACTION TUBELET DETECTOR FOR SPATIO-TEMPORAL VIDEO ACTION DETECTION
    Wu, Yutang
    Wang, Hanli
    Wang, Shuheng
    Li, Qinyu
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 2388 - 2392