Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video

被引:0
作者
Wu, Jie [1 ,3 ]
Zhang, Wei [2 ]
Li, Guanbin [1 ]
Wu, Wenhao [2 ]
Tan, Xiao [2 ]
Li, Yingying [2 ]
Ding, Errui [2 ]
Lin, Liang [1 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Peoples R China
[2] Baidu Inc, Beijing, Peoples R China
[3] ByteDance Inc, Beijing, Peoples R China
来源
PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021 | 2021年
基金
中国国家自然科学基金;
关键词
LOCALIZATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce a novel task, referred to as Weakly-Supervised Spatio-Temporal Anomaly Detection (WSSTAD) in surveillance video. Specifically, given an untrimmed video, WSSTAD aims to localize a spatio-temporal tube (i.e., a sequence of bounding boxes at consecutive times) that encloses the abnormal event, with only coarse videolevel annotations as supervision during training. To address this challenging task, we propose a dual-branch network which takes as input the proposals with multi-granularities in both spatial-temporal domains. Each branch employs a relationship reasoning module to capture the correlation between tubes/videolets, which can provide rich contextual information and complex entity relationships for the concept learning of abnormal behaviors. Mutually-guided Progressive Refinement framework is set up to employ dual-path mutual guidance in a recurrent manner, iteratively sharing auxiliary supervision information across branches. It impels the learned concepts of each branch to serve as a guide for its counterpart, which progressively refines the corresponding branch and the whole framework. Furthermore, we contribute two datasets, i.e., ST-UCF-Crime and STRA, consisting of videos containing spatio-temporal abnormal annotations to serve as the benchmarks for WSSTAD. We conduct extensive qualitative and quantitative evaluations to demonstrate the effectiveness of the proposed approach and analyze the key factors that contribute more to handle this task.
引用
收藏
页码:1172 / 1178
页数:7
相关论文
共 25 条
[1]  
[Anonymous], 2020, P ACMMM, DOI DOI 10.1145/3394171.3416279
[2]  
Chen ZF, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P1884
[3]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[4]  
Escorcia Victor, 2020, COMPUTER VISION IMAG
[5]   Learning Temporal Regularity in Video Sequences [J].
Hasan, Mahmudul ;
Choi, Jonghyun ;
Neumann, Jan ;
Roy-Chowdhury, Amit K. ;
Davis, Larry S. .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :733-742
[6]   Anomaly Detection and Localization in Crowded Scenes [J].
Li, Weixin ;
Mahadevan, Vijay ;
Vasconcelos, Nuno .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (01) :18-32
[7]  
Li Y., 2020, P CVPRW, P586
[8]   A Revisit of Sparse Coding Based Anomaly Detection in Stacked RNN Framework [J].
Luo, Weixin ;
Liu, Wen ;
Gao, Shenghua .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :341-349
[9]  
Mettes Pascal, 2018, ARXIV180702800
[10]  
Nallaivarothayan H, 2014, 2014 11TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), P343, DOI 10.1109/AVSS.2014.6918692