CLIP-TSA: CLIP-ASSISTED TEMPORAL SELF-ATTENTION FOR WEAKLY-SUPERVISED VIDEO ANOMALY DETECTION

被引：27

作者：

Joo, Hyekang Kevin ^{[1
]}

Khoa Vo ^{[2
]}

Yamazaki, Kashu ^{[2
]}

Ngan Le ^{[2
]}

机构：

[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA

[2] Univ Arkansas, Dept Comp Sci & Comp Engn, Fayetteville, AR USA

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2023年

基金：

美国国家科学基金会;

关键词：

video anomaly detection; temporal self-attention; weakly supervised; multimodal model; subtlety;

D O I：

10.1109/ICIP49359.2023.10222289

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video anomaly detection (VAD) - commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature - is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal dependencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus and XD-Violence). Our source code is available at https://github.com/joos2010kj/CLIP-TSA.

引用

页码：3230 / 3234

页数：5

共 51 条

[31] Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning [J].

Tian, Yu ;

Pang, Guansong ;

Chen, Yuanhong ;

Singh, Rajvinder ;

Verjans, Johan W. ;

Carneiro, Gustavo .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :4955-4966

[32] AGENT-ENVIRONMENT NETWORK FOR TEMPORAL ACTION PROPOSAL GENERATION [J].

Viet-Khoa Vo-Ho ;

Le, Ngan ;

Kamazaki, Kashu ;

Sugimoto, Akihiro ;

Minh-Triet Tran .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :2160-2164

[33]

Vo K., 2021, BMVC

[34] ABN: Agent-Aware Boundary Networks for Temporal Action Proposal Generation [J].

Vo, Khoa ;

Yamazaki, Kashu ;

Truong, Sang ;

Tran, Minh-Triet ;

Sugimoto, Akihiro ;

Le, Ngan .

IEEE ACCESS, 2021, 9 :126431-126445

[35] (2+1)D Distilled ShuffleNet: A Lightweight Unsupervised Distillation Network for Human Action Recognition [J].

Vu, Duc-Quang ;

Le, Ngan T. H. ;

Wang, Jia-Ching .

2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :3197-3203

[36] WEAKLY SUPERVISED VIDEO ANOMALY DETECTION VIA CENTER-GUIDED DISCRIMINATIVE LEARNING [J].

Wan, Boyang ;

Fang, Yuming ;

Xia, Xue ;

Mei, Jiajie .

2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,

[37] GODS: Generalized One-class Discriminative Subspaces for Anomaly Detection [J].

Wang, Jue ;

Cherian, Anoop .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8200-8210

[38] Event-Centric Hierarchical Representation for Dense Video Captioning [J].

Wang, Teng ;

Zheng, Huicheng ;

Yu, Mingjing ;

Tian, Qian ;

Hu, Haifeng .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (05) :1890-1900

[39]

Wang XP, 2018, IDEAS HIST MOD CHINA, V19, P1, DOI 10.1163/9789004385580_002

[40] Not only Look, But Also Listen: Learning Multimodal Violence Detection Under Weak Supervision [J].

Wu, Peng ;

Liu, Jing ;

Shi, Yujia ;

Sun, Yujia ;

Shao, Fangtao ;

Wu, Zhaoyang ;

Yang, Zhiwei .

COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :322-339

← 1 2 3 4 5 6 →