Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization

被引:0
作者
Cui, Chenhao [1 ]
Liang, Xinnian [1 ]
Wu, Shuangzhi [2 ]
Li, Zhoujun [1 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Bytedance, Beijing, Peoples R China
来源
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年
关键词
D O I
10.1109/IJCNN54540.2023.10191104
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Natural language video localization (NLVL) aims to locate the matching span relevant to a given query sentence from an untrimmed video. This task requires not only understanding video and text but also aligning the semantics between video and language. Existing methods obtain vision-language representations via separate encoders, cross-modal interactions are not fine-grained enough, and the semantics are not fully aligned. In this paper, we address the vision-language alignment via joint modeling and contrastive learning. We propose a unified VideoLanguage Representation Network (UniNet), employing a transformer encoder to learn vision-language representations aligned. Simultaneously taking video and text as input, the encoder jointly learns the representations of both and captures the inter-relations between video and text. Then the representations are used by the predictor to locate the grounding video span. Besides, we train our model with contrastive learning to enhance vision-language representations in the training stage. Experiments on three benchmark datasets show that UniNet outperforms the baseline methods and adopting unified representation and contrastive learning can improve vision-language semantic alignment.
引用
收藏
页数:8
相关论文
共 38 条
  • [11] Li G, 2020, AAAI CONF ARTIF INTE, V34, P11336
  • [12] Li W, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, P2592
  • [13] Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction
    Lin, Zhijie
    Zhao, Zhou
    Zhang, Zhu
    Zhang, Zijian
    Cai, Deng
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 3750 - 3762
  • [14] Liu B., 2018, P EUROPEAN C COMPUTE
  • [15] Liu DZ, 2022, AAAI CONF ARTIF INTE, P1665
  • [16] Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization
    Liu, Daizong
    Qu, Xiaoye
    Liu, Xiao-Yang
    Dong, Jianfeng
    Zhou, Pan
    Xu, Zichuan
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4070 - 4078
  • [17] Context-aware Biaffine Localizing Network for Temporal Sentence Grounding
    Liu, Daizong
    Qu, Xiaoye
    Dong, Jianfeng
    Zhou, Pan
    Cheng, Yu
    Wei, Wei
    Xu, Zichuan
    Xie, Yulai
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11230 - 11239
  • [18] Cross-modal Moment Localization in Videos
    Liu, Meng
    Wang, Xiang
    Nie, Liqiang
    Tian, Qi
    Chen, Baoquan
    Chua, Tat-Seng
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 843 - 851
  • [19] Lu CJ, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5144
  • [20] Pennington J., 2014, P C EMP METH NAT LAN, P1532