Learning Proposal-Aware Re-Ranking for Weakly-Supervised Temporal Action Localization

被引:15
作者
Hu, Yufan [1 ,2 ]
Fu, Jie [3 ]
Chen, Mengyuan [3 ]
Gao, Junyu [3 ]
Dong, Jianfeng [4 ]
Fan, Bin [1 ,2 ]
Liu, Hongmin [1 ,2 ]
机构
[1] Univ Sci & Technol Beijing, Key Lab Intelligent Bion Unmanned Syst, Minist Educ, Sch Intelligence Sci & Technol, Beijing 100083, Peoples R China
[2] Univ Sci & Technol Beijing, Inst Artificial Intelligence, Beijing 100083, Peoples R China
[3] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China
[4] Zhejiang Gongshang Univ, Coll Comp & Informat Engn, Hangzhou 310018, Peoples R China
基金
北京市自然科学基金;
关键词
Proposals; Feature extraction; Location awareness; Videos; Measurement; Task analysis; Optimization; weakly-supervised temporal action localization; Proposal-aware reranking; NETWORK; VIDEO; RETRIEVAL; ATTENTION;
D O I
10.1109/TCSVT.2023.3283430
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Weakly-supervised temporal action localization (WTAL) aims to localize and classify action instances in untrimmed videos with only video-level labels available. Despite the remarkable success of existing methods, whose generated proposals are commonly far more than the ground-truth action instances, it still makes sense to improve the ranking accuracy of the generated proposals since users in real-world scenarios usually prioritize the action proposals with the highest confidence scores. The inaccuracy of the proposal ranking mainly comes from two aspects: For one thing, the traditional proposal generation manner entirely relies on snippet-level perception, resulting in a significant yet unnoticed gap with the target of proposal-level localization. For another, existing methods commonly employ a hand-crafted proposal generation manner, a post-process that does not participate in model optimization. To address the above issues, we propose an end-to-end trained two-stage method, termed as Learning Proposal-aware Re-ranking (LPR) for WTAL. In the first stage, we design a proposal-aware feature learning module to inject the proposal-aware contextual information into each snippet, and then the enhanced features are utilized for predicting initial proposals. Furthermore, to perform effective and efficient proposal re-ranking, in the second stage, we contrast the proposals attached with high confidence scores with our constructed multi-scale foreground/background prototypes for further optimization. Evaluated by both the vanilla and Top- $k$ mAP metrics, results of extensive experiments on two popular benchmarks demonstrate the effectiveness of our proposed method.
引用
收藏
页码:207 / 220
页数:14
相关论文
共 78 条
[1]   Re-ranking via Metric Fusion for Object Retrieval and Person Re-identification [J].
Bai, Song ;
Tang, Peng ;
Torr, Philip H. S. ;
Latecki, Longin Jan .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :740-749
[2]  
Bai Y., 2020, P ECCV, P1
[3]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[4]   Unifying Deep Local and Global Features for Image Search [J].
Cao, Bingyi ;
Araujo, Andre ;
Sim, Jack .
COMPUTER VISION - ECCV 2020, PT XX, 2020, 12365 :726-743
[5]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[6]   Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].
Chao, Yu-Wei ;
Vijayanarasimhan, Sudheendra ;
Seybold, Bryan ;
Ross, David A. ;
Deng, Jia ;
Sukthankar, Rahul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139
[7]   Dual-Evidential Learning for Weakly-supervised Temporal Action Localization [J].
Chen, Mengyuan ;
Gao, Junyu ;
Yang, Shicai ;
Xu, Changsheng .
COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 :192-208
[8]   Person re-identification based on re-ranking with expanded k-reciprocal nearest neighbors [J].
Chen, Ying ;
Yuan, Jin ;
Li, Zhiyong ;
Wu, Yiqiang ;
Nouioua, Mourad ;
Xie, Guoqi .
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2019, 58 :486-494
[9]  
Cheng F, 2022, Arxiv, DOI arXiv:2204.01680
[10]   Temporal Context Network for Activity Localization in Videos [J].
Dai, Xiyang ;
Singh, Bharat ;
Zhang, Guyue ;
Davis, Larry S. ;
Chen, Yan Qiu .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5727-5736