SPATIO-TEMPORAL MOTION AGGREGATION NETWORK FOR VIDEO ACTION DETECTION

被引:3
作者
Zhang, Hongcheng [1 ]
Zhao, Xu [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Automat, Shanghai, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
video understanding; video action detection; spatio-temporal action detection; anchor-free detector;
D O I
10.1109/ICASSP43922.2022.9746817
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recognizing action patterns and detecting action instances are vital for spatial temporal action detection task, which aims to recognize the actions of interest in untrimmed videos and localize them in both space and time. The mainstream action tubelet detectors, however, ignore the conflicts in features between localization and classification, and use localization features for temporal modeling, which leads to ineffective action classification. In this paper, we propose the Spatio-Temporal Motion Aggregation mechanism for integrating the local motion feature from a short term snippet and the longer spatio-temporal information to predict the action category. We design the Class-Agnostic Center Localization module to perform action instance center localization in the Class-Agnostic manner. Besides, Movement and Size Regression is proposed for movement estimation and spatial extent detection by using Gaussian kernels to encode training samples. These three modules work together to generate the tubelet detection results, which could be further linked to yield video-level tubes with a matching strategy. Our detector achieves the state-of-the-art performance in both frame-mAP and video-mAP metrics, on the UCF-24 and JHMDB datasets.
引用
收藏
页码:2180 / 2184
页数:5
相关论文
共 25 条
[1]  
[Anonymous], 2016, Procedings of the British Machine Vision Conference 2016, DOI DOI 10.5244/C.30.58
[2]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[3]  
Gkioxari G, 2015, PROC CVPR IEEE, P759, DOI 10.1109/CVPR.2015.7298676
[4]   AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions [J].
Gu, Chunhui ;
Sun, Chen ;
Ross, David A. ;
Vondrick, Carl ;
Pantofaru, Caroline ;
Li, Yeqing ;
Vijayanarasimhan, Sudheendra ;
Toderici, George ;
Ricco, Susanna ;
Sukthankar, Rahul ;
Schmid, Cordelia ;
Malik, Jitendra .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6047-6056
[5]   Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [J].
Hara, Kensho ;
Kataoka, Hirokatsu ;
Satoh, Yutaka .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6546-6555
[6]   Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos [J].
Hou, Rui ;
Chen, Chen ;
Shah, Mubarak .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5823-5832
[7]   Towards understanding action recognition [J].
Jhuang, Hueihan ;
Gall, Juergen ;
Zuffi, Silvia ;
Schmid, Cordelia ;
Black, Michael J. .
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :3192-3199
[8]   STM: SpatioTemporal and Motion Encoding for Action Recognition [J].
Jiang, Boyuan ;
Wang, MengMeng ;
Gan, Weihao ;
Wu, Wei ;
Yan, Junjie .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2000-2009
[9]   Action Tubelet Detector for Spatio-Temporal Action Localization [J].
Kalogeiton, Vicky ;
Weinzaepfel, Philippe ;
Ferrari, Vittorio ;
Schmid, Cordelia .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4415-4423
[10]   TEA: Temporal Excitation and Aggregation for Action Recognition [J].
Li, Yan ;
Ji, Bin ;
Shi, Xintian ;
Zhang, Jianguo ;
Kang, Bin ;
Wang, Limin .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :906-915