Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

被引：116

作者：

Zhai, Yuanhao ^{[1
]}

Wang, Le ^{[1
]}

Tang, Wei ^{[2
]}

Zhang, Qilin ^{[3
]}

Yuan, Junsong ^{[4
]}

Hua, Gang ^{[5
]}

机构：

[1] Xi An Jiao Tong Univ, Xian, Shaanxi, Peoples R China

[2] Univ Illinois, Chicago, IL USA

[3] HERE Technol, Chicago, IL USA

[4] SUNY Buffalo, Buffalo, NY USA

[5] Wormpex AI Res, Bellevue, WA USA

来源：

COMPUTER VISION - ECCV 2020, PT VI | 2020年 / 12351卷

基金：

中国博士后科学基金; 国家重点研发计划;

关键词：

Temporal action localization; Weakly-supervised learning; HISTOGRAMS;

D O I：

10.1007/978-3-030-58539-6_3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Weakly-supervised Temporal Action Localization (W-TAL) aims to classify and localize all action instances in an untrimmed video under only video-level supervision. However, without frame-level annotations, it is challenging for W-TAL methods to identify false positive action proposals and generate action proposals with precise temporal boundaries. In this paper, we present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges. The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated, and used to provide frame-level supervision for improved model training and false positive action proposal elimination. Furthermore, we propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries. Experiments conducted on the THUMOS14 and ActivityNet datasets show that the proposed TSCN outperforms current state-of-the-art methods, and even achieves comparable results with some recent fully-supervised methods.

引用

页码：37 / 54

页数：18

共 49 条

[1]

[Anonymous], 2015, THUMOS challenge: action recognition with a large number of classes

[2]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[4] Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].

Chao, Yu-Wei ;

Vijayanarasimhan, Sudheendra ;

Seybold, Bryan ;

Ross, David A. ;

Deng, Jia ;

Sukthankar, Rahul .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139

[5] MARS: Motion-Augmented RGB Stream for Action Recognition [J].

Crasto, Nieves ;

Weinzaepfel, Philippe ;

Alahari, Karteek ;

Schmid, Cordelia .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7874-7883

[6] Temporal Context Network for Activity Localization in Videos [J].

Dai, Xiyang ;

Singh, Bharat ;

Zhang, Guyue ;

Davis, Larry S. ;

Chen, Yan Qiu .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5727-5736

[7] Histograms of oriented gradients for human detection [J].

Dalal, N ;

Triggs, B .

2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893

[8] Human detection using oriented histograms of flow and appearance [J].

Dalal, Navneet ;

Triggs, Bill ;

Schmid, Cordelia .

COMPUTER VISION - ECCV 2006, PT 2, PROCEEDINGS, 2006, 3952 :428-441

[9]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[10] Convolutional Two-Stream Network Fusion for Video Action Recognition [J].

Feichtenhofer, Christoph ;

Pinz, Axel ;

Zisserman, Andrew .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941

← 1 2 3 4 5 →