SSMTL plus plus : Revisiting self-supervised multi-task learning for video anomaly detection

被引:50
作者
Barbalau, Antonio [1 ]
Ionescu, Radu Tudor [1 ,2 ,3 ]
Georgescu, Mariana-Iuliana [1 ,2 ]
Dueholm, Jacob [4 ,5 ]
Ramachandra, Bharathkumar [6 ]
Nasrollahi, Kamal [4 ,5 ]
Khan, Fahad Shahbaz [3 ,7 ]
Moeslund, Thomas B. [4 ]
Shah, Mubarak [8 ]
机构
[1] Univ Bucharest, Dept Comp Sci, 14 Acad St, Bucharest 010014, Romania
[2] SecurifAI, 21D Mircea Voda, Bucharest 030662, Romania
[3] MBZ Univ Artificial Intelligence, Abu Dhabi, U Arab Emirates
[4] Aalborg Univ, Dept Architecture Design & Media Technol, Rendsburggade 14, DK-9000 Aalborg, Denmark
[5] Milestone Syst, Banemarksvej 50C, DK-2605 Brondby, Denmark
[6] Geopipe Inc, 460W 51st, New York, NY 10019 USA
[7] Linkoping Univ, S-58183 Linkoping, Sweden
[8] Univ Cent Florida, Ctr Res Comp Vis CRCV, Orlando, FL 32816 USA
关键词
Anomaly detection; Self-supervised learning; Multi-task learning; Neural networks; Transformers; ADVERSARIAL NETWORK; LOCALIZATION;
D O I
10.1016/j.cviu.2023.103656
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A self-supervised multi-task learning (SSMTL) framework for video anomaly detection was recently introduced in literature. Due to its highly accurate results, the method attracted the attention of many researchers. In this work, we revisit the self-supervised multi-task learning framework, proposing several updates to the original method. First, we study various detection methods, e.g. based on detecting high-motion regions using optical flow or background subtraction, since we believe the currently used pre-trained YOLOv3 is suboptimal, e.g. objects in motion or objects from unknown classes are never detected. Second, we modernize the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers. As such, we alternatively introduce both 2D and 3D convolutional vision transformer (CvT) blocks. Third, in our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps through knowledge distillation, solving jigsaw puzzles, estimating body pose through knowledge distillation, predicting masked regions (inpainting), and adversarial learning with pseudo-anomalies. We conduct experiments to assess the performance impact of the introduced changes. Upon finding more promising configurations of the framework, dubbed SSMTL++v1 and SSMTL++v2, we extend our preliminary experiments to more data sets, demonstrating that our performance gains are consistent across all data sets. In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the state-of-the-art performance bar to a new level.
引用
收藏
页数:13
相关论文
共 95 条
[1]  
Acsintoae A., 2022, PROC IEEECVF C COMPU, P20143
[2]  
Antic B, 2011, IEEE I CONF COMP VIS, P2415, DOI 10.1109/ICCV.2011.6126525
[3]   UniPose: Unified Human Pose Estimation in Single Images and Videos [J].
Artacho, Bruno ;
Savakis, Andreas .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :7033-7042
[4]   Synthetic Temporal Anomaly Guided End-to-End Video Anomaly Detection [J].
Astrid, Marcella ;
Zaheer, Muhammad Zaigham ;
Lee, Seung-Ik .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, :207-214
[5]  
Astrid Marcella, 2021, P BMVC
[6]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[7]  
Bin Zhao, 2011, 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), P3313, DOI 10.1109/CVPR.2011.5995524
[8]   Video anomaly detection with spatio-temporal dissociation [J].
Chang, Yunpeng ;
Tu, Zhigang ;
Xie, Wei ;
Luo, Bin ;
Zhang, Shifu ;
Sui, Haigang ;
Yuan, Junsong .
PATTERN RECOGNITION, 2022, 122
[9]  
Cheng KW, 2015, PROC CVPR IEEE, P2909, DOI 10.1109/CVPR.2015.7298909
[10]   Sparse Reconstruction Cost for Abnormal Event Detection [J].
Cong, Yang ;
Yuan, Junsong ;
Liu, Ji .
2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011, :1807-+