Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

被引:1
|
作者
Tan, Chaolei [1 ]
Lai, Jianhuang [1 ,2 ,3 ]
Zheng, Wei-Shi [1 ,2 ,3 ]
Hu, Jian-Fang [1 ,2 ,3 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Guangdong, Peoples R China
[2] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou, Guangdong, Peoples R China
[3] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Beijing, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年
关键词
D O I
10.1109/CVPR52733.2024.01288
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal or-der from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to elim-inate the need of temporal annotations. Different from pre-vious weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Aug-mentation Branch is utilized for directly regressing the tem-poral boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multi-ple sentences in a normal video. We demonstrate by exten-sive experiments that our paradigm has superior practica-bility and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.
引用
收藏
页码:13569 / 13580
页数:12
相关论文
共 50 条
  • [1] Weakly-Supervised Alignment of Video With Text
    Bojanowski, P.
    Lajugie, R.
    Grave, E.
    Bach, F.
    Laptev, I.
    Ponce, J.
    Schmid, C.
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4462 - 4470
  • [2] Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning
    Wang, Ye
    Lin, Wang
    Zhang, Shengyu
    Jin, Tao
    Li, Linjun
    Cheng, Xize
    Zhao, Zhou
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 10914 - 10932
  • [3] Weakly-Supervised Video Object Grounding via Stable Context Learning
    Wang, Wei
    Gao, Junyu
    Xu, Changsheng
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 760 - 768
  • [4] WINNER: Weakly-supervised hIerarchical decompositioN and aligNment for spatio-tEmporal video gRounding
    Li, Mengze
    Wang, Han
    Zhang, Wengiao
    Miao, Jiaxu
    Zhao, Zhou
    Zhang, Shengyu
    Ji, Wei
    Wu, Fei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23090 - 23099
  • [5] Iterative Proposal Refinement for Weakly-Supervised Video Grounding
    School of Electronic and Computer Engineering, Peking University, China
    不详
    不详
    不详
    Proc IEEE Comput Soc Conf Comput Vision Pattern Recognit, (6524-6534): : 6524 - 6534
  • [6] MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding
    Wang, Qinxin
    Tan, Hao
    Shen, Sheng
    Mahoney, Michael W.
    Yao, Zhewei
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2030 - 2038
  • [7] Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment
    Jin, Yang
    Mu, Yadong
    COMPUTER VISION - ECCV 2024, PT XLVIII, 2025, 15106 : 412 - 429
  • [8] Inverse Compositional Learning for Weakly-supervised Relation Grounding
    Li, Huan
    Wei, Ping
    Ma, Zeyu
    Zheng, Nanning
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15431 - 15441
  • [9] Weakly-Supervised Video Object Grounding via Causal Intervention
    Wang, Wei
    Gao, Junyu
    Xu, Changsheng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) : 3933 - 3948
  • [10] Weakly-Supervised Video Object Grounding via Learning Uni-Modal Associations
    Wang, Wei
    Gao, Junyu
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6329 - 6340