Semi-supervised Video Paragraph Grounding with Contrastive Encoder

被引：22

作者：

Jiang, Xun ^{[1
,2
]}

Xu, Xing ^{[1
,2
]}

Zhang, Jingran ^{[1
,2
]}

Shen, Fumin ^{[1
,2
]}

Cao, Zuo ^{[3
]}

Shen, Heng Tao ^{[1
,2
,4
]}

机构：

[1] Univ Elect Sci & Technol China, Ctr Future Media, Chengdu, Peoples R China

[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu, Peoples R China

[3] MeiTuan, Beijing, Peoples R China

[4] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

基金：

中国国家自然科学基金;

关键词：

ATTENTION; NETWORK;

D O I：

10.1109/CVPR52688.2022.00250

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video events grounding aims at retrieving the most relevant moments from an untrimmed video in terms of a given natural language query. Most previous works focus on Video Sentence Grounding (VSG), which localizes the moment with a sentence query. Recently, researchers extended this task to Video Paragraph Grounding (VPG) by retrieving multiple events with a paragraph. However, we find the existing VPG methods may not perform well on context modeling and highly rely on video-paragraph annotations. To tackle this problem, we propose a novel VPG method termed Semi-supervised Video-Paragraph TRansformer (SVPTR), which can more effectively exploit contextual information in paragraphs and significantly reduce the dependency on annotated data. Our SVPTR method consists of two key components: (1) a base model VPTR that learns the video-paragraph alignment with contrastive encoders and tackles the lack of sentence-level contextual interactions and (2) a semi-supervised learning framework with multimodal feature perturbations that reduces the requirements of annotated training data. We evaluate our model on three widely-used video grounding datasets, i.e., ActivityNet-Caption, Charades-CD-OOD, and TACoS. The experimental results show that our SVPTR method establishes the new state-of-the-art performance on all datasets. Even under the conditions of fewer annotations, it can also achieve competitive results compared with recent VPG methods.

引用

页码：2456 / 2465

页数：10

共 51 条

[1]

[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.01186

[2]

[Anonymous], 2021, AAAI C ART INT, DOI DOI 10.1109/COMPSAC51774.2021.00048

[3]

Bao PJ, 2021, AAAI CONF ARTIF INTE, V35, P920

[4] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[5]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[6] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[7]

Duan W, 2006, PROCEEDINGS OF 2006 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, P3063

[8]

Freitag M., 2017, P 1 WORKSH NEUR MACH, P56, DOI [10.18653/V1/W17-3207, DOI 10.18653/V1/W17-3207]

[9]

Ghosh S, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P1984

[10] Momentum Contrast for Unsupervised Visual Representation Learning [J].

He, Kaiming ;

Fan, Haoqi ;

Wu, Yuxin ;

Xie, Saining ;

Girshick, Ross .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9726-9735

← 1 2 3 4 5 6 →