Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization

被引:37
作者
Li, Jingjing [1 ]
Yang, Tianyu [2 ]
Ji, Wei [1 ]
Wang, Jue [2 ]
Cheng, Li [1 ]
机构
[1] Univ Alberta, Edmonton, AB, Canada
[2] Tencent AI Lab, Shenzhen, Peoples R China
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
基金
加拿大自然科学与工程研究理事会;
关键词
D O I
10.1109/CVPR52688.2022.01929
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly-supervised temporal action localization aims to localize actions in untrimmed videos with only video-level labels. Most existing methods address this problem with a "localization-by-classification" pipeline that localizes action regions based on snippet-wise classification sequences. Snippet-wise classifications are unfortunately error prone due to the sparsity of video-level labels. Inspired by recent success in unsupervised contrastive representation learning, we propose a novel denoised cross-video contrastive algorithm, aiming to enhance the feature discrimination ability of video snippets for accurate temporal action localization in the weakly-supervised setting. This is enabled by three key designs: I) an effective pseudo-label denoising module to alleviate the side effects caused by noisy contrastive features, 2) an efficient region-level feature contrast strategy with a region-level memory bank to capture "global" contrast across the entire dataset, and 3) a diverse contrastive learning strategy to enable action-background separation as well as intra-class compactness & inter-class separability. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate the superior performance of our approach.
引用
收藏
页码:19882 / 19892
页数:11
相关论文
共 69 条
  • [1] Buch Shyamal, 2019, P BRIT MACH VIS C BM, P2
  • [2] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [3] Computer-vision-based abnormal human behavior detection and analysis in electric power plant
    Cao, Yuan
    Xu, Hao
    Yang, Qiang
    [J]. PROCEEDINGS OF THE 33RD CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2021), 2021, : 1578 - 1583
  • [4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [5] Rethinking the Faster R-CNN Architecture for Temporal Action Localization
    Chao, Yu-Wei
    Vijayanarasimhan, Sudheendra
    Seybold, Bryan
    Ross, David A.
    Deng, Jia
    Sukthankar, Rahul
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1130 - 1139
  • [6] Chen Ting, 2020, P 33 INT C MACH LEAR, P1597
  • [7] Chen Tsai-Shien, 2021, ARXIV210603719
  • [8] Temporal Context Network for Activity Localization in Videos
    Dai, Xiyang
    Singh, Bharat
    Zhang, Guyue
    Davis, Larry S.
    Chen, Yan Qiu
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5727 - 5736
  • [9] Solving the multiple instance problem with axis-parallel rectangles
    Dietterich, TG
    Lathrop, RH
    LozanoPerez, T
    [J]. ARTIFICIAL INTELLIGENCE, 1997, 89 (1-2) : 31 - 71
  • [10] Dosovitskiy Alexey, 2014, Advances in Neural Information Processing Systems