Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization

被引：0

作者：

Cui, Chenhao ^{[1
]}

Liang, Xinnian ^{[1
]}

Wu, Shuangzhi ^{[2
]}

Li, Zhoujun ^{[1
]}

机构：

[1] Beihang Univ, Beijing, Peoples R China

[2] Bytedance, Beijing, Peoples R China

来源：

2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年

关键词：

D O I：

10.1109/IJCNN54540.2023.10191104

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Natural language video localization (NLVL) aims to locate the matching span relevant to a given query sentence from an untrimmed video. This task requires not only understanding video and text but also aligning the semantics between video and language. Existing methods obtain vision-language representations via separate encoders, cross-modal interactions are not fine-grained enough, and the semantics are not fully aligned. In this paper, we address the vision-language alignment via joint modeling and contrastive learning. We propose a unified VideoLanguage Representation Network (UniNet), employing a transformer encoder to learn vision-language representations aligned. Simultaneously taking video and text as input, the encoder jointly learns the representations of both and captures the inter-relations between video and text. Then the representations are used by the predictor to locate the grounding video span. Besides, we train our model with contrastive learning to enhance vision-language representations in the training stage. Experiments on three benchmark datasets show that UniNet outperforms the baseline methods and adopting unified representation and contrastive learning can improve vision-language semantic alignment.

引用

页数：8

共 38 条

[11] Li G, 2020, AAAI CONF ARTIF INTE, V34, P11336
[12] Li W, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, P2592
[13] Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction
Lin, Zhijie
Zhao, Zhou
Zhang, Zhu
Zhang, Zijian
Cai, Deng
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 3750 - 3762
[14] Liu B., 2018, P EUROPEAN C COMPUTE
[15] Liu DZ, 2022, AAAI CONF ARTIF INTE, P1665
[16] Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization
Liu, Daizong
Qu, Xiaoye
Liu, Xiao-Yang
Dong, Jianfeng
Zhou, Pan
Xu, Zichuan
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4070 - 4078
[17] Context-aware Biaffine Localizing Network for Temporal Sentence Grounding
Liu, Daizong
Qu, Xiaoye
Dong, Jianfeng
Zhou, Pan
Cheng, Yu
Wei, Wei
Xu, Zichuan
Xie, Yulai
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11230 - 11239
[18] Cross-modal Moment Localization in Videos
Liu, Meng
Wang, Xiang
Nie, Liqiang
Tian, Qi
Chen, Baoquan
Chua, Tat-Seng
[J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 843 - 851
[19] Lu CJ, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5144
[20] Pennington J., 2014, P C EMP METH NAT LAN, P1532

← 1 2 3 4 →