Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

被引：1

作者：

Tan, Chaolei ^{[1
]}

Lai, Jianhuang ^{[1
,2
,3
]}

Zheng, Wei-Shi ^{[1
,2
,3
]}

Hu, Jian-Fang ^{[1
,2
,3
]}

机构：

[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Guangdong, Peoples R China

[2] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou, Guangdong, Peoples R China

[3] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Beijing, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

关键词：

D O I：

10.1109/CVPR52733.2024.01288

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal or-der from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to elim-inate the need of temporal annotations. Different from pre-vious weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Aug-mentation Branch is utilized for directly regressing the tem-poral boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multi-ple sentences in a normal video. We demonstrate by exten-sive experiments that our paradigm has superior practica-bility and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.

引用

页码：13569 / 13580

页数：12

共 50 条

[31] Local Boosting for Weakly-Supervised Learning [J].

Zhang, Rongzhi ;

Yu, Yue ;

Shen, Jiaming ;

Cui, Xiquan ;

Zhang, Chao .

PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, :3364-3375

[32] End-to-end weakly-supervised semantic alignment [J].

Rocco, Ignacio ;

Arandjelovic, Relja ;

Sivic, Josef .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6917-6925

[33] Weakly-supervised video anomaly detection via temporal resolution feature learning [J].

Shengjun Peng ;

Yiheng Cai ;

Zijun Yao ;

Meiling Tan .

Applied Intelligence, 2023, 53 :30607-30625

[34] Weakly-supervised spatial-temporal video grounding via spatial-temporal annotation on a frame [J].

Luo, Shu ;

Jiang, Shijie ;

Cao, Da ;

Deng, Huangxiao ;

Wang, Jiawei ;

Qin, Zheng .

KNOWLEDGE-BASED SYSTEMS, 2025, 314

[35] Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement Learning Method [J].

Ramos, Washington ;

Silva, Michel ;

Araujo, Edson ;

Moura, Victor ;

Oliveira, Keller ;

Marcolino, Leandro Soriano ;

Nascimento, Erickson R. R. .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) :2492-2504

[36] Weakly-supervised video anomaly detection via temporal resolution feature learning [J].

Peng, Shengjun ;

Cai, Yiheng ;

Yao, Zijun ;

Tan, Meiling .

APPLIED INTELLIGENCE, 2023, 53 (24) :30607-30625

[37] Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning [J].

Tian, Yu ;

Pang, Guansong ;

Chen, Yuanhong ;

Singh, Rajvinder ;

Verjans, Johan W. ;

Carneiro, Gustavo .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :4955-4966

[38] Weakly-Supervised Ultrasound Video Segmentation with Minimal Annotations [J].

Chang, Ruiheng ;

Wang, Dong ;

Guo, Haiyan ;

Ding, Jia ;

Wang, Liwei .

MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT VIII, 2021, 12908 :648-658

[39] Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding [J].

Mo, Shentong ;

Liu, Daizong ;

Hu, Wei .

arXiv, 2022,

[40] Weakly-Supervised Video Scene Co-parsing [J].

Zhong, Guangyu ;

Tsai, Yi-Hsuan ;

Yang, Ming-Hsuan .

COMPUTER VISION - ACCV 2016, PT I, 2017, 10111 :20-36

← 1 2 3 4 5 →