Learning Proposal-Aware Re-Ranking for Weakly-Supervised Temporal Action Localization

被引：15

作者：

Hu, Yufan ^{[1
,2
]}

Fu, Jie ^{[3
]}

Chen, Mengyuan ^{[3
]}

Gao, Junyu ^{[3
]}

Dong, Jianfeng ^{[4
]}

Fan, Bin ^{[1
,2
]}

Liu, Hongmin ^{[1
,2
]}

机构：

[1] Univ Sci & Technol Beijing, Key Lab Intelligent Bion Unmanned Syst, Minist Educ, Sch Intelligence Sci & Technol, Beijing 100083, Peoples R China

[2] Univ Sci & Technol Beijing, Inst Artificial Intelligence, Beijing 100083, Peoples R China

[3] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China

[4] Zhejiang Gongshang Univ, Coll Comp & Informat Engn, Hangzhou 310018, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 01期

基金：

北京市自然科学基金;

关键词：

Proposals; Feature extraction; Location awareness; Videos; Measurement; Task analysis; Optimization; weakly-supervised temporal action localization; Proposal-aware reranking; NETWORK; VIDEO; RETRIEVAL; ATTENTION;

D O I：

10.1109/TCSVT.2023.3283430

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Weakly-supervised temporal action localization (WTAL) aims to localize and classify action instances in untrimmed videos with only video-level labels available. Despite the remarkable success of existing methods, whose generated proposals are commonly far more than the ground-truth action instances, it still makes sense to improve the ranking accuracy of the generated proposals since users in real-world scenarios usually prioritize the action proposals with the highest confidence scores. The inaccuracy of the proposal ranking mainly comes from two aspects: For one thing, the traditional proposal generation manner entirely relies on snippet-level perception, resulting in a significant yet unnoticed gap with the target of proposal-level localization. For another, existing methods commonly employ a hand-crafted proposal generation manner, a post-process that does not participate in model optimization. To address the above issues, we propose an end-to-end trained two-stage method, termed as Learning Proposal-aware Re-ranking (LPR) for WTAL. In the first stage, we design a proposal-aware feature learning module to inject the proposal-aware contextual information into each snippet, and then the enhanced features are utilized for predicting initial proposals. Furthermore, to perform effective and efficient proposal re-ranking, in the second stage, we contrast the proposals attached with high confidence scores with our constructed multi-scale foreground/background prototypes for further optimization. Evaluated by both the vanilla and Top- $k$ mAP metrics, results of extensive experiments on two popular benchmarks demonstrate the effectiveness of our proposed method.

引用

页码：207 / 220

页数：14

共 78 条

[1] Re-ranking via Metric Fusion for Object Retrieval and Person Re-identification [J].

Bai, Song ;

Tang, Peng ;

Torr, Philip H. S. ;

Latecki, Longin Jan .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :740-749

[2]

Bai Y., 2020, P ECCV, P1

[3]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[4] Unifying Deep Local and Global Features for Image Search [J].

Cao, Bingyi ;

Araujo, Andre ;

Sim, Jack .

COMPUTER VISION - ECCV 2020, PT XX, 2020, 12365 :726-743

[5] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[6] Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].

Chao, Yu-Wei ;

Vijayanarasimhan, Sudheendra ;

Seybold, Bryan ;

Ross, David A. ;

Deng, Jia ;

Sukthankar, Rahul .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139

[7] Dual-Evidential Learning for Weakly-supervised Temporal Action Localization [J].

Chen, Mengyuan ;

Gao, Junyu ;

Yang, Shicai ;

Xu, Changsheng .

COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 :192-208

[8] Person re-identification based on re-ranking with expanded k-reciprocal nearest neighbors [J].

Chen, Ying ;

Yuan, Jin ;

Li, Zhiyong ;

Wu, Yiqiang ;

Nouioua, Mourad ;

Xie, Guoqi .

JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2019, 58 :486-494

[9]

Cheng F, 2022, Arxiv, DOI arXiv:2204.01680

[10] Temporal Context Network for Activity Localization in Videos [J].

Dai, Xiyang ;

Singh, Bharat ;

Zhang, Guyue ;

Davis, Larry S. ;

Chen, Yan Qiu .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5727-5736

← 1 2 3 4 5 6 7 8 →