Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

被引:1
作者
Tan, Chaolei [1 ]
Lai, Jianhuang [1 ,2 ,3 ]
Zheng, Wei-Shi [1 ,2 ,3 ]
Hu, Jian-Fang [1 ,2 ,3 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Guangdong, Peoples R China
[2] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou, Guangdong, Peoples R China
[3] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Beijing, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年
关键词
D O I
10.1109/CVPR52733.2024.01288
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal or-der from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to elim-inate the need of temporal annotations. Different from pre-vious weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Aug-mentation Branch is utilized for directly regressing the tem-poral boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multi-ple sentences in a normal video. We demonstrate by exten-sive experiments that our paradigm has superior practica-bility and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.
引用
收藏
页码:13569 / 13580
页数:12
相关论文
共 50 条
[31]   Local Boosting for Weakly-Supervised Learning [J].
Zhang, Rongzhi ;
Yu, Yue ;
Shen, Jiaming ;
Cui, Xiquan ;
Zhang, Chao .
PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, :3364-3375
[32]   End-to-end weakly-supervised semantic alignment [J].
Rocco, Ignacio ;
Arandjelovic, Relja ;
Sivic, Josef .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6917-6925
[33]   Weakly-supervised video anomaly detection via temporal resolution feature learning [J].
Shengjun Peng ;
Yiheng Cai ;
Zijun Yao ;
Meiling Tan .
Applied Intelligence, 2023, 53 :30607-30625
[34]   Weakly-supervised spatial-temporal video grounding via spatial-temporal annotation on a frame [J].
Luo, Shu ;
Jiang, Shijie ;
Cao, Da ;
Deng, Huangxiao ;
Wang, Jiawei ;
Qin, Zheng .
KNOWLEDGE-BASED SYSTEMS, 2025, 314
[35]   Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method [J].
Ramos, Washington ;
Silva, Michel ;
Araujo, Edson ;
Moura, Victor ;
Oliveira, Keller ;
Marcolino, Leandro Soriano ;
Nascimento, Erickson R. R. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) :2492-2504
[36]   Weakly-supervised video anomaly detection via temporal resolution feature learning [J].
Peng, Shengjun ;
Cai, Yiheng ;
Yao, Zijun ;
Tan, Meiling .
APPLIED INTELLIGENCE, 2023, 53 (24) :30607-30625
[37]   Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning [J].
Tian, Yu ;
Pang, Guansong ;
Chen, Yuanhong ;
Singh, Rajvinder ;
Verjans, Johan W. ;
Carneiro, Gustavo .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :4955-4966
[38]   Weakly-Supervised Ultrasound Video Segmentation with Minimal Annotations [J].
Chang, Ruiheng ;
Wang, Dong ;
Guo, Haiyan ;
Ding, Jia ;
Wang, Liwei .
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT VIII, 2021, 12908 :648-658
[39]   Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding [J].
Mo, Shentong ;
Liu, Daizong ;
Hu, Wei .
arXiv, 2022,
[40]   Weakly-Supervised Video Scene Co-parsing [J].
Zhong, Guangyu ;
Tsai, Yi-Hsuan ;
Yang, Ming-Hsuan .
COMPUTER VISION - ACCV 2016, PT I, 2017, 10111 :20-36