CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning

被引:114
作者
Zhang, Can [1 ]
Cao, Meng [1 ]
Yang, Dongming [1 ]
Chen, Jie [1 ,2 ]
Zou, Yuexian [1 ,2 ]
机构
[1] Peking Univ, Sch Elect & Comp Engn, Beijing, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
关键词
D O I
10.1109/CVPR46437.2021.01575
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly-supervised temporal action localization (WS-TAL) aims to localize actions in untrimmed videos with only video-level labels. Most existing models follow the "localization by classification" procedure: locate temporal regions contributing most to the video-level classification. Generally, they process each snippet (or frame) individually and thus overlook the fruitful temporal context relation. Here arises the single snippet cheating issue: "hard" snippets are too vague to be classified. In this paper, we argue that learning by comparing helps identify these hard snippets and we propose to utilize snippet Contrastive learning to Localize Actions, CoLA for short. Specifically, we propose a Snippet Contrast (SniCo) Loss to refine the hard snippet representation in feature space, which guides the network to perceive precise temporal boundaries and avoid the temporal interval interruption. Besides, since it is infeasible to access frame-level annotations, we introduce a Hard Snippet Mining algorithm to locate the potential hard snippets. Substantial analyses verify that this mining strategy efficaciously captures the hard snippets and SniCo Loss leads to more informative feature representation. Extensive experiments show that CoLA achieves state-of-the-art results on THUMOS'14 and ActivityNet v1.2 datasets.
引用
收藏
页码:16005 / 16014
页数:10
相关论文
共 48 条
[1]  
[Anonymous], 2020, AAAI
[2]  
Bachman P, 2019, ADV NEUR IN, V32
[3]  
Buch Shyamal., 2019, PROC BRIT MACH VIS C
[4]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[5]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[6]   Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].
Chao, Yu-Wei ;
Vijayanarasimhan, Sudheendra ;
Seybold, Bryan ;
Ross, David A. ;
Deng, Jia ;
Sukthankar, Rahul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139
[7]  
Chen T, 2020, PR MACH LEARN RES, V119
[8]   Temporal Context Network for Activity Localization in Videos [J].
Dai, Xiyang ;
Singh, Bharat ;
Zhang, Guyue ;
Davis, Larry S. ;
Chen, Yan Qiu .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5727-5736
[9]  
Eslami S., 2019, IEEE C COMP VIS PATT
[10]  
Gutmann Michael, 2010, 13 INT C ARTIFICIAL, P297, DOI DOI 10.1145/3292500.3330651