Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization

被引：37

作者：

Li, Jingjing ^{[1
]}

Yang, Tianyu ^{[2
]}

Ji, Wei ^{[1
]}

Wang, Jue ^{[2
]}

Cheng, Li ^{[1
]}

机构：

[1] Univ Alberta, Edmonton, AB, Canada

[2] Tencent AI Lab, Shenzhen, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

基金：

加拿大自然科学与工程研究理事会;

关键词：

D O I：

10.1109/CVPR52688.2022.01929

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Weakly-supervised temporal action localization aims to localize actions in untrimmed videos with only video-level labels. Most existing methods address this problem with a "localization-by-classification" pipeline that localizes action regions based on snippet-wise classification sequences. Snippet-wise classifications are unfortunately error prone due to the sparsity of video-level labels. Inspired by recent success in unsupervised contrastive representation learning, we propose a novel denoised cross-video contrastive algorithm, aiming to enhance the feature discrimination ability of video snippets for accurate temporal action localization in the weakly-supervised setting. This is enabled by three key designs: I) an effective pseudo-label denoising module to alleviate the side effects caused by noisy contrastive features, 2) an efficient region-level feature contrast strategy with a region-level memory bank to capture "global" contrast across the entire dataset, and 3) a diverse contrastive learning strategy to enable action-background separation as well as intra-class compactness & inter-class separability. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate the superior performance of our approach.

引用

页码：19882 / 19892

页数：11

共 69 条

[1] Buch Shyamal, 2019, P BRIT MACH VIS C BM, P2
[2] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[3] Computer-vision-based abnormal human behavior detection and analysis in electric power plant
Cao, Yuan
Xu, Hao
Yang, Qiang
[J]. PROCEEDINGS OF THE 33RD CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2021), 2021, : 1578 - 1583
[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[5] Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Chao, Yu-Wei
Vijayanarasimhan, Sudheendra
Seybold, Bryan
Ross, David A.
Deng, Jia
Sukthankar, Rahul
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1130 - 1139
[6] Chen Ting, 2020, P 33 INT C MACH LEAR, P1597
[7] Chen Tsai-Shien, 2021, ARXIV210603719
[8] Temporal Context Network for Activity Localization in Videos
Dai, Xiyang
Singh, Bharat
Zhang, Guyue
Davis, Larry S.
Chen, Yan Qiu
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5727 - 5736
[9] Solving the multiple instance problem with axis-parallel rectangles
Dietterich, TG
Lathrop, RH
LozanoPerez, T
[J]. ARTIFICIAL INTELLIGENCE, 1997, 89 (1-2) : 31 - 71
[10] Dosovitskiy Alexey, 2014, Advances in Neural Information Processing Systems

← 1 2 3 4 5 6 7 →