Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

被引:1
作者
Tan, Chaolei [1 ]
Lai, Jianhuang [1 ,2 ,3 ]
Zheng, Wei-Shi [1 ,2 ,3 ]
Hu, Jian-Fang [1 ,2 ,3 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Guangdong, Peoples R China
[2] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou, Guangdong, Peoples R China
[3] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Beijing, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年
关键词
D O I
10.1109/CVPR52733.2024.01288
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal or-der from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to elim-inate the need of temporal annotations. Different from pre-vious weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Aug-mentation Branch is utilized for directly regressing the tem-poral boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multi-ple sentences in a normal video. We demonstrate by exten-sive experiments that our paradigm has superior practica-bility and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.
引用
收藏
页码:13569 / 13580
页数:12
相关论文
共 50 条
  • [21] Weakly-supervised Visual Grounding of Phrases with Linguistic Structures
    Xiao, Fanyi
    Sigal, Leonid
    Lee, Yong Jae
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5253 - 5262
  • [22] Not All Frames Are Equal: Weakly-Supervised Video Grounding with Contextual Similarity and Visual Clustering Losses
    Shi, Jing
    Xu, Jia
    Gong, Boqing
    Xu, Chenliang
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10436 - 10444
  • [23] Bi-calibration Networks for Weakly-Supervised Video Representation Learning
    Long, Fuchen
    Yao, Ting
    Qiu, Zhaofan
    Tian, Xinmei
    Luo, Jiebo
    Mei, Tao
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (07) : 1704 - 1721
  • [24] Bi-calibration Networks for Weakly-Supervised Video Representation Learning
    Fuchen Long
    Ting Yao
    Zhaofan Qiu
    Xinmei Tian
    Jiebo Luo
    Tao Mei
    International Journal of Computer Vision, 2023, 131 : 1704 - 1721
  • [25] Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding
    Bao, Peijun
    Xia, Yong
    Yang, Wenhan
    Ng, Boon Poh
    Er, Meng Hwa
    Kot, Alex C.
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 738 - 746
  • [26] Weakly-Supervised RGBD Video Object Segmentation
    Yang, Jinyu
    Gao, Mingqi
    Zheng, Feng
    Zhen, Xiantong
    Ji, Rongrong
    Shao, Ling
    Leonardis, Ales
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 2158 - 2170
  • [27] Semi-supervised Video Paragraph Grounding with Contrastive Encoder
    Jiang, Xun
    Xu, Xing
    Zhang, Jingran
    Shen, Fumin
    Cao, Zuo
    Shen, Heng Tao
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 2456 - 2465
  • [28] Weakly-supervised learning of visual relations
    Peyre, Julia
    Laptev, Ivan
    Schmid, Cordelia
    Sivic, Josef
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5189 - 5198
  • [29] Local Boosting for Weakly-Supervised Learning
    Zhang, Rongzhi
    Yu, Yue
    Shen, Jiaming
    Cui, Xiquan
    Zhang, Chao
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 3364 - 3375
  • [30] Weakly-supervised Joint Anomaly Detection and Classification
    Majhi, Snehashis
    Das, Srijan
    Bremond, Francois
    Dash, Ratnakar
    Sa, Pankaj Kumar
    2021 16TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2021), 2021,