Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

被引：1

作者：

Tan, Chaolei ^{[1
]}

Lai, Jianhuang ^{[1
,2
,3
]}

Zheng, Wei-Shi ^{[1
,2
,3
]}

Hu, Jian-Fang ^{[1
,2
,3
]}

机构：

[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Guangdong, Peoples R China

[2] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou, Guangdong, Peoples R China

[3] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Beijing, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

关键词：

D O I：

10.1109/CVPR52733.2024.01288

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal or-der from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to elim-inate the need of temporal annotations. Different from pre-vious weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Aug-mentation Branch is utilized for directly regressing the tem-poral boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multi-ple sentences in a normal video. We demonstrate by exten-sive experiments that our paradigm has superior practica-bility and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.

引用

页码：13569 / 13580

页数：12

共 50 条

[1] Weakly-Supervised Alignment of Video With Text
Bojanowski, P.
Lajugie, R.
Grave, E.
Bach, F.
Laptev, I.
Ponce, J.
Schmid, C.
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4462 - 4470
[2] Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning
Wang, Ye
Lin, Wang
Zhang, Shengyu
Jin, Tao
Li, Linjun
Cheng, Xize
Zhao, Zhou
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 10914 - 10932
[3] Weakly-Supervised Video Object Grounding via Stable Context Learning
Wang, Wei
Gao, Junyu
Xu, Changsheng
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 760 - 768
[4] WINNER: Weakly-supervised hIerarchical decompositioN and aligNment for spatio-tEmporal video gRounding
Li, Mengze
Wang, Han
Zhang, Wengiao
Miao, Jiaxu
Zhao, Zhou
Zhang, Shengyu
Ji, Wei
Wu, Fei
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23090 - 23099
[5] Iterative Proposal Refinement for Weakly-Supervised Video Grounding
School of Electronic and Computer Engineering, Peking University, China
不详
不详
不详
Proc IEEE Comput Soc Conf Comput Vision Pattern Recognit, (6524-6534): : 6524 - 6534
[6] MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding
Wang, Qinxin
Tan, Hao
Shen, Sheng
Mahoney, Michael W.
Yao, Zhewei
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2030 - 2038
[7] Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment
Jin, Yang
Mu, Yadong
COMPUTER VISION - ECCV 2024, PT XLVIII, 2025, 15106 : 412 - 429
[8] Inverse Compositional Learning for Weakly-supervised Relation Grounding
Li, Huan
Wei, Ping
Ma, Zeyu
Zheng, Nanning
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15431 - 15441
[9] Weakly-Supervised Video Object Grounding via Causal Intervention
Wang, Wei
Gao, Junyu
Xu, Changsheng
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) : 3933 - 3948
[10] Weakly-Supervised Video Object Grounding via Learning Uni-Modal Associations
Wang, Wei
Gao, Junyu
Xu, Changsheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6329 - 6340

← 1 2 3 4 5 →