Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

被引:1
作者
Tan, Chaolei [1 ]
Lai, Jianhuang [1 ,2 ,3 ]
Zheng, Wei-Shi [1 ,2 ,3 ]
Hu, Jian-Fang [1 ,2 ,3 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Guangdong, Peoples R China
[2] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou, Guangdong, Peoples R China
[3] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Beijing, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年
关键词
D O I
10.1109/CVPR52733.2024.01288
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal or-der from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to elim-inate the need of temporal annotations. Different from pre-vious weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Aug-mentation Branch is utilized for directly regressing the tem-poral boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multi-ple sentences in a normal video. We demonstrate by exten-sive experiments that our paradigm has superior practica-bility and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.
引用
收藏
页码:13569 / 13580
页数:12
相关论文
共 50 条
[41]   Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding [J].
Mo, Shentong ;
Liu, Daizong ;
Hu, Wei .
arXiv, 2022,
[42]   Weakly-Supervised Video Scene Co-parsing [J].
Zhong, Guangyu ;
Tsai, Yi-Hsuan ;
Yang, Ming-Hsuan .
COMPUTER VISION - ACCV 2016, PT I, 2017, 10111 :20-36
[43]   Weakly-Supervised Generation and Grounding of Visual Descriptions with Conditional Generative Models [J].
Mavroudi, Effrosyni ;
Vidal, Rene .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15523-15533
[44]   Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction [J].
Liu, Yi ;
Pan, Junwen ;
Wang, Qilong ;
Chen, Guanlin ;
Nie, Weiguo ;
Zhang, Yudong ;
Gao, Qian ;
Hu, Qinghua ;
Zhu, Pengfei .
ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I, 2024, 14473 :156-169
[45]   Weakly-supervised Disentanglement Network for Video Fingerspelling Detection [J].
Jiang, Ziqi ;
Zhang, Shengyu ;
Yao, Siyuan ;
Zhang, Wenqiao ;
Zhang, Sihan ;
Li, Juncheng ;
Zhao, Zhou ;
Wu, Fei .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :5446-5455
[46]   Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding [J].
Shaharabany, Tal ;
Wolf, Lior .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :6925-6934
[47]   Weakly-Supervised Reinforcement Learning for Controllable Behavior [J].
Lee, Lisa ;
Eysenbach, Benjamin ;
Salakhutdinov, Ruslan ;
Gu, Shane ;
Finn, Chelsea .
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[48]   Weakly-supervised Learning of Schrödinger Equation [J].
Shiina, Kenta ;
Lee, Hwee Kuan ;
Okabe, Yutaka ;
Mori, Hiroyuki .
JOURNAL OF THE PHYSICAL SOCIETY OF JAPAN, 2024, 93 (06)
[49]   A WEAKLY-SUPERVISED DISCRIMINATIVE MODEL FOR AUDIO-TO-SCORE ALIGNMENT [J].
Lajugie, Remi ;
Bojanowski, Piotr ;
Cuvillier, Philippe ;
Arlot, Sylvain ;
Bach, Francis .
2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, :2484-2488
[50]   Decoupling foreground and background with Siamese ViT networks for weakly-supervised semantic segmentation [J].
Lin, Meiling ;
Li, Gongyan ;
Xu, Shaoyun ;
Hao, Yuexing ;
Zhang, Shu .
NEUROCOMPUTING, 2024, 610