Boosting Weakly-Supervised Temporal Action Localization with Text Information

被引:25
作者
Li, Guozhang [1 ]
Cheng, De [1 ]
Ding, Xinpeng [2 ]
Wang, Nannan [1 ]
Wang, Xiaoyu [3 ]
Gao, Xinbo [1 ,4 ]
机构
[1] Xidian Univ, Xian, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[3] Chinese Univ Hong Kong Shenzhen, Shenzhen, Peoples R China
[4] Chongqing Univ Posts & Telecommun, Chongqing, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.01026
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the lack of temporal annotation, current Weakly-supervised Temporal Action Localization (WTAL) methods are generally stuck into over-complete or incomplete localization. In this paper, we aim to leverage the text information to boost WTAL from two aspects, i.e., (a) the discriminative objective to enlarge the inter-class difference, thus reducing the over-complete; (b) the generative objective to enhance the intra-class integrity, thus finding more complete temporal boundaries. For the discriminative objective, we propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments. Without the temporal annotation of actions, TSM compares the text query with the entire videos across the dataset to mine the best matching segments while ignoring irrelevant ones. Due to the shared sub-actions in different categories of videos, merely applying TSM is too strict to neglect the semantic-related segments, which results in incomplete localization. We further introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence. We achieve the state-of-the-art performance on THUMOS14 and ActivityNet1.3. Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin. The code is available at https://github.com/lgzlIlIlI/Boosting-WTAL.
引用
收藏
页码:10648 / 10657
页数:10
相关论文
共 55 条
[1]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[2]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[3]  
Cho K., 2014, LEARNING PHRASE REPR, DOI [10.3115/v1/D14-1179, DOI 10.3115/V1/D14-1179]
[4]   CRNet: Centroid Radiation Network for Temporal Action Localization [J].
Ding, Xinpeng ;
Wang, Nannan ;
Li, Jie ;
Gao, Xinbo .
PATTERN RECOGNITION AND COMPUTER VISION, PT I, 2021, 13019 :29-41
[5]   Support-Set Based Cross-Supervision for Video Grounding [J].
Ding, Xinpeng ;
Wang, Nannan ;
Zhang, Shiwei ;
Cheng, De ;
Li, Xiaomeng ;
Huang, Ziyuan ;
Tang, Mingqian ;
Gao, Xinbo .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11553-11562
[6]   KFC: An Efficient Framework for Semi-Supervised Temporal Action Localization [J].
Ding, Xinpeng ;
Wang, Nannan ;
Gao, Xinbo ;
Li, Jie ;
Wang, Xiaoyu ;
Liu, Tongliang .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :6869-6878
[7]  
Gao Junyu, 2022, ARXIV220316800
[8]  
Gong Guoqiang., Self-supervised video action localization with adversarial temporal transforms
[9]  
He Bo, 2022, ARXIV220315187
[10]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]