Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

被引:1
作者
Tan, Chaolei [1 ]
Lai, Jianhuang [1 ,2 ,3 ]
Zheng, Wei-Shi [1 ,2 ,3 ]
Hu, Jian-Fang [1 ,2 ,3 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Guangdong, Peoples R China
[2] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou, Guangdong, Peoples R China
[3] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Beijing, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年
关键词
D O I
10.1109/CVPR52733.2024.01288
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal or-der from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to elim-inate the need of temporal annotations. Different from pre-vious weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Aug-mentation Branch is utilized for directly regressing the tem-poral boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multi-ple sentences in a normal video. We demonstrate by exten-sive experiments that our paradigm has superior practica-bility and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.
引用
收藏
页码:13569 / 13580
页数:12
相关论文
共 50 条
[21]   VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval [J].
Ma, Minuk ;
Yoon, Sunjae ;
Kim, Junyeong ;
Lee, Youngjoon ;
Kang, Sunghun ;
Yoo, Chang D. .
COMPUTER VISION - ECCV 2020, PT XXVIII, 2020, 12373 :156-171
[22]   Triadic temporal-semantic alignment for weakly-supervised video moment retrieval [J].
Liu, Jin ;
Xie, Jialong ;
Zhou, Fengyu ;
He, Shengfeng .
PATTERN RECOGNITION, 2024, 156
[23]   Weakly-supervised Visual Grounding of Phrases with Linguistic Structures [J].
Xiao, Fanyi ;
Sigal, Leonid ;
Lee, Yong Jae .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5253-5262
[24]   Bi-calibration Networks for Weakly-Supervised Video Representation Learning [J].
Long, Fuchen ;
Yao, Ting ;
Qiu, Zhaofan ;
Tian, Xinmei ;
Luo, Jiebo ;
Mei, Tao .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (07) :1704-1721
[25]   Bi-calibration Networks for Weakly-Supervised Video Representation Learning [J].
Fuchen Long ;
Ting Yao ;
Zhaofan Qiu ;
Xinmei Tian ;
Jiebo Luo ;
Tao Mei .
International Journal of Computer Vision, 2023, 131 :1704-1721
[26]   Not All Frames Are Equal: Weakly-Supervised Video Grounding with Contextual Similarity and Visual Clustering Losses [J].
Shi, Jing ;
Xu, Jia ;
Gong, Boqing ;
Xu, Chenliang .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :10436-10444
[27]   Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding [J].
Bao, Peijun ;
Xia, Yong ;
Yang, Wenhan ;
Ng, Boon Poh ;
Er, Meng Hwa ;
Kot, Alex C. .
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, :738-746
[28]   Weakly-Supervised RGBD Video Object Segmentation [J].
Yang, Jinyu ;
Gao, Mingqi ;
Zheng, Feng ;
Zhen, Xiantong ;
Ji, Rongrong ;
Shao, Ling ;
Leonardis, Ales .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 :2158-2170
[29]   Semi-supervised Video Paragraph Grounding with Contrastive Encoder [J].
Jiang, Xun ;
Xu, Xing ;
Zhang, Jingran ;
Shen, Fumin ;
Cao, Zuo ;
Shen, Heng Tao .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :2456-2465
[30]   Weakly-supervised learning of visual relations [J].
Peyre, Julia ;
Laptev, Ivan ;
Schmid, Cordelia ;
Sivic, Josef .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5189-5198