Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

被引：1

作者：

Tan, Chaolei ^{[1
]}

Lai, Jianhuang ^{[1
,2
,3
]}

Zheng, Wei-Shi ^{[1
,2
,3
]}

Hu, Jian-Fang ^{[1
,2
,3
]}

机构：

[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Guangdong, Peoples R China

[2] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou, Guangdong, Peoples R China

[3] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Beijing, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

关键词：

D O I：

10.1109/CVPR52733.2024.01288

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal or-der from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to elim-inate the need of temporal annotations. Different from pre-vious weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Aug-mentation Branch is utilized for directly regressing the tem-poral boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multi-ple sentences in a normal video. We demonstrate by exten-sive experiments that our paradigm has superior practica-bility and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.

引用

页码：13569 / 13580

页数：12

共 50 条

[41] Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding [J].

Mo, Shentong ;

Liu, Daizong ;

Hu, Wei .

arXiv, 2022,

[42] Weakly-Supervised Video Scene Co-parsing [J].

Zhong, Guangyu ;

Tsai, Yi-Hsuan ;

Yang, Ming-Hsuan .

COMPUTER VISION - ACCV 2016, PT I, 2017, 10111 :20-36

[43] Weakly-Supervised Generation and Grounding of Visual Descriptions with Conditional Generative Models [J].

Mavroudi, Effrosyni ;

Vidal, Rene .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15523-15533

[44] Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction [J].

Liu, Yi ;

Pan, Junwen ;

Wang, Qilong ;

Chen, Guanlin ;

Nie, Weiguo ;

Zhang, Yudong ;

Gao, Qian ;

Hu, Qinghua ;

Zhu, Pengfei .

ARTIFICIAL INTELLIGENCE, CICAI 2023, PT I, 2024, 14473 :156-169

[45] Weakly-supervised Disentanglement Network for Video Fingerspelling Detection [J].

Jiang, Ziqi ;

Zhang, Shengyu ;

Yao, Siyuan ;

Zhang, Wenqiao ;

Zhang, Sihan ;

Li, Juncheng ;

Zhao, Zhou ;

Wu, Fei .

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :5446-5455

[46] Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding [J].

Shaharabany, Tal ;

Wolf, Lior .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :6925-6934

[47] Weakly-Supervised Reinforcement Learning for Controllable Behavior [J].

Lee, Lisa ;

Eysenbach, Benjamin ;

Salakhutdinov, Ruslan ;

Gu, Shane ;

Finn, Chelsea .

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33

[48] Weakly-supervised Learning of Schrödinger Equation [J].

Shiina, Kenta ;

Lee, Hwee Kuan ;

Okabe, Yutaka ;

Mori, Hiroyuki .

JOURNAL OF THE PHYSICAL SOCIETY OF JAPAN, 2024, 93 (06)

[49] A WEAKLY-SUPERVISED DISCRIMINATIVE MODEL FOR AUDIO-TO-SCORE ALIGNMENT [J].

Lajugie, Remi ;

Bojanowski, Piotr ;

Cuvillier, Philippe ;

Arlot, Sylvain ;

Bach, Francis .

2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, :2484-2488

[50] Decoupling foreground and background with Siamese ViT networks for weakly-supervised semantic segmentation [J].

Lin, Meiling ;

Li, Gongyan ;

Xu, Shaoyun ;

Hao, Yuexing ;

Zhang, Shu .

NEUROCOMPUTING, 2024, 610

← 1 2 3 4 5 →