Enriching Local and Global Contexts for Temporal Action Localization

被引：76

作者：

Zhu, Zixin ^{[1
]}

Tang, Wei ^{[2
]}

Wang, Le ^{[1
]}

Zheng, Nanning ^{[1
]}

Hua, Gang ^{[3
]}

机构：

[1] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Xian, Peoples R China

[2] Univ Illinois, Chicago, IL USA

[3] Wormpex AI Res, Bellevue, WA 98004 USA

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

基金：

国家重点研发计划;

关键词：

ACTION RECOGNITION;

D O I：

10.1109/ICCV48922.2021.01326

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Effectively tackling the problem of temporal action localization (TAL) necessitates a visual representation that jointly pursues two confounding goals, i.e., fine-grained discrimination for temporal localization and sufficient visual invariance for action classification. We address this challenge by enriching both the local and global contexts in the popular two-stage temporal localization framework, where action proposals are first generated followed by action classification and temporal boundary regression. Our proposed model, dubbed ContextLoc, can be divided into three subnetworks: L-Net, G-Net and P-Net. L-Net enriches the local context via fine-grained modeling of snippet-level features, which is formulated as a query-and-retrieval process. G-Net enriches the global context via higher-level modeling of the video-level representation. In addition, we introduce a novel context adaptation module to adapt the global context to different proposals. P-Net further models the context-aware inter-proposal relations. We explore two existing models to be the P-Net in our experiments. The efficacy of our proposed method is validated by experimental results on the THUMOS14 (54.3% at tIoU@0.5) and ActivityNet v1.3 (56.01% at tIoU@0.5) datasets, which outperforms recent states of the art. Code is available at https://github.com/buxiangzhiren/ContextLoc.

引用

页码：13496 / 13505

页数：10

共 51 条

[41] A Robust and Efficient Video Representation for Action Recognition [J].

Wang, Heng ;

Oneata, Dan ;

Verbeek, Jakob ;

Schmid, Cordelia .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2016, 119 (03) :219-238

[42] Action Recognition with Improved Trajectories [J].

Wang, Heng ;

Schmid, Cordelia .

2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :3551-3558

[43] UntrimmedNets for Weakly Supervised Action Recognition and Detection [J].

Wang, Limin ;

Xiong, Yuanjun ;

Lin, Dahua ;

Van Gool, Luc .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6402-6411

[44] Deep Parametric Continuous Convolutional Neural Networks [J].

Wang, Shenlong ;

Suo, Simon ;

Ma, Wei-Chiu ;

Pokrovsky, Andrei ;

Urtasun, Raquel .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :2589-2597

[45] R-C3D: Region Convolutional 3D Network for Temporal Activity Detection [J].

Xu, Huijuan ;

Das, Abir ;

Saenko, Kate .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5794-5803

[46] G-TAD: Sub-Graph Localization for Temporal Action Detection [J].

Xu, Mengmeng ;

Zhao, Chen ;

Rojas, David S. ;

Thabet, Ali ;

Ghanem, Bernard .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10153-10162

[47] End-to-end Learning of Action Detection from Frame Glimpses in Videos [J].

Yeung, Serena ;

Russakovsky, Olga ;

Mori, Greg ;

Li Fei-Fei .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2678-2687

[48] Graph Convolutional Networks for Temporal Action Localization [J].

Zeng, Runhao ;

Huang, Wenbing ;

Tan, Mingkui ;

Rong, Yu ;

Zhao, Peilin ;

Huang, Junzhou ;

Gan, Chuang .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7093-7102

[49] Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization [J].

Zhai, Yuanhao ;

Wang, Le ;

Tang, Wei ;

Zhang, Qilin ;

Yuan, Junsong ;

Hua, Gang .

COMPUTER VISION - ECCV 2020, PT VI, 2020, 12351 :37-54

[50] Bottom-Up Temporal Action Localization with Mutual Regularization [J].

Zhao, Peisen ;

Xie, Lingxi ;

Ju, Chen ;

Zhang, Ya ;

Wang, Yanfeng ;

Tian, Qi .

COMPUTER VISION - ECCV 2020, PT VIII, 2020, 12353 :539-555

← 1 2 3 4 5 6 →