Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization

被引：0

作者：

Cui, Chenhao ^{[1
]}

Liang, Xinnian ^{[1
]}

Wu, Shuangzhi ^{[2
]}

Li, Zhoujun ^{[1
]}

机构：

[1] Beihang Univ, Beijing, Peoples R China

[2] Bytedance, Beijing, Peoples R China

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

关键词：

D O I：

10.1109/IJCNN54540.2023.10191104

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Natural language video localization (NLVL) aims to locate the matching span relevant to a given query sentence from an untrimmed video. This task requires not only understanding video and text but also aligning the semantics between video and language. Existing methods obtain vision-language representations via separate encoders, cross-modal interactions are not fine-grained enough, and the semantics are not fully aligned. In this paper, we address the vision-language alignment via joint modeling and contrastive learning. We propose a unified VideoLanguage Representation Network (UniNet), employing a transformer encoder to learn vision-language representations aligned. Simultaneously taking video and text as input, the encoder jointly learns the representations of both and captures the inter-relations between video and text. Then the representations are used by the predictor to locate the grounding video span. Besides, we train our model with contrastive learning to enhance vision-language representations in the training stage. Experiments on three benchmark datasets show that UniNet outperforms the baseline methods and adopting unified representation and contrastive learning can improve vision-language semantic alignment.

引用

页数：8

共 38 条

[1] [Anonymous], P 3 INT C LEARN REPR
[2] Cao M, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P9810
[3] Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
[4] Chen Ting, 2019, 25 AMERICAS C INFORM
[5] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
[6] Gao J., 2017, P IEEE INT C COMPUTE
[7] Gao JL, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P3978
[8] He DL, 2019, AAAI CONF ARTIF INTE, P8393
[9] Localizing Moments in Video with Natural Language
Hendricks, Lisa Anne
Wang, Oliver
Shechtman, Eli
Sivic, Josef
Darrell, Trevor
Russell, Bryan
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5804 - 5813
[10] Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention
Jiang, Bin
Huang, Xin
Yang, Chao
Yuan, Junsong
[J]. ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 217 - 225

← 1 2 3 4 →