Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

被引：1

作者：

Tan, Chaolei ^{[1
]}

Lai, Jianhuang ^{[1
,2
,3
]}

Zheng, Wei-Shi ^{[1
,2
,3
]}

Hu, Jian-Fang ^{[1
,2
,3
]}

机构：

[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Guangdong, Peoples R China

[2] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou, Guangdong, Peoples R China

[3] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Beijing, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

关键词：

D O I：

10.1109/CVPR52733.2024.01288

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video Paragraph Grounding (VPG) is an emerging task in video-language understanding, which aims at localizing multiple sentences with semantic relations and temporal or-der from an untrimmed video. However, existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work, we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to elim-inate the need of temporal annotations. Different from pre-vious weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking, we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Aug-mentation Branch is utilized for directly regressing the tem-poral boundaries of a complete paragraph within a pseudo video, and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multi-ple sentences in a normal video. We demonstrate by exten-sive experiments that our paradigm has superior practica-bility and flexibility to achieve efficient weakly-supervised or semi-supervised learning, outperforming state-of-the-art methods trained with the same or stronger supervision.

引用

页码：13569 / 13580

页数：12

共 50 条

[21] VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval [J].

Ma, Minuk ;

Yoon, Sunjae ;

Kim, Junyeong ;

Lee, Youngjoon ;

Kang, Sunghun ;

Yoo, Chang D. .

COMPUTER VISION - ECCV 2020, PT XXVIII, 2020, 12373 :156-171

[22] Triadic temporal-semantic alignment for weakly-supervised video moment retrieval [J].

Liu, Jin ;

Xie, Jialong ;

Zhou, Fengyu ;

He, Shengfeng .

PATTERN RECOGNITION, 2024, 156

[23] Weakly-supervised Visual Grounding of Phrases with Linguistic Structures [J].

Xiao, Fanyi ;

Sigal, Leonid ;

Lee, Yong Jae .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5253-5262

[24] Bi-calibration Networks for Weakly-Supervised Video Representation Learning [J].

Long, Fuchen ;

Yao, Ting ;

Qiu, Zhaofan ;

Tian, Xinmei ;

Luo, Jiebo ;

Mei, Tao .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (07) :1704-1721

[25] Bi-calibration Networks for Weakly-Supervised Video Representation Learning [J].

Fuchen Long ;

Ting Yao ;

Zhaofan Qiu ;

Xinmei Tian ;

Jiebo Luo ;

Tao Mei .

International Journal of Computer Vision, 2023, 131 :1704-1721

[26] Not All Frames Are Equal: Weakly-Supervised Video Grounding with Contextual Similarity and Visual Clustering Losses [J].

Shi, Jing ;

Xu, Jia ;

Gong, Boqing ;

Xu, Chenliang .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :10436-10444

[27] Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding [J].

Bao, Peijun ;

Xia, Yong ;

Yang, Wenhan ;

Ng, Boon Poh ;

Er, Meng Hwa ;

Kot, Alex C. .

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, :738-746

[28] Weakly-Supervised RGBD Video Object Segmentation [J].

Yang, Jinyu ;

Gao, Mingqi ;

Zheng, Feng ;

Zhen, Xiantong ;

Ji, Rongrong ;

Shao, Ling ;

Leonardis, Ales .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 :2158-2170

[29] Semi-supervised Video Paragraph Grounding with Contrastive Encoder [J].

Jiang, Xun ;

Xu, Xing ;

Zhang, Jingran ;

Shen, Fumin ;

Cao, Zuo ;

Shen, Heng Tao .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :2456-2465

[30] Weakly-supervised learning of visual relations [J].

Peyre, Julia ;

Laptev, Ivan ;

Schmid, Cordelia ;

Sivic, Josef .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5189-5198

← 1 2 3 4 5 →