Learning Unified Video-Language Representations via Joint Modeling and Contrastive Learning for Natural Language Video Localization

被引:0
作者
Cui, Chenhao [1 ]
Liang, Xinnian [1 ]
Wu, Shuangzhi [2 ]
Li, Zhoujun [1 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Bytedance, Beijing, Peoples R China
来源
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN | 2023年
关键词
D O I
10.1109/IJCNN54540.2023.10191104
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Natural language video localization (NLVL) aims to locate the matching span relevant to a given query sentence from an untrimmed video. This task requires not only understanding video and text but also aligning the semantics between video and language. Existing methods obtain vision-language representations via separate encoders, cross-modal interactions are not fine-grained enough, and the semantics are not fully aligned. In this paper, we address the vision-language alignment via joint modeling and contrastive learning. We propose a unified VideoLanguage Representation Network (UniNet), employing a transformer encoder to learn vision-language representations aligned. Simultaneously taking video and text as input, the encoder jointly learns the representations of both and captures the inter-relations between video and text. Then the representations are used by the predictor to locate the grounding video span. Besides, we train our model with contrastive learning to enhance vision-language representations in the training stage. Experiments on three benchmark datasets show that UniNet outperforms the baseline methods and adopting unified representation and contrastive learning can improve vision-language semantic alignment.
引用
收藏
页数:8
相关论文
共 38 条
  • [1] [Anonymous], P 3 INT C LEARN REPR
  • [2] Cao M, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P9810
  • [3] Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
  • [4] Chen Ting, 2019, 25 AMERICAS C INFORM
  • [5] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [6] Gao J., 2017, P IEEE INT C COMPUTE
  • [7] Gao JL, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P3978
  • [8] He DL, 2019, AAAI CONF ARTIF INTE, P8393
  • [9] Localizing Moments in Video with Natural Language
    Hendricks, Lisa Anne
    Wang, Oliver
    Shechtman, Eli
    Sivic, Josef
    Darrell, Trevor
    Russell, Bryan
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5804 - 5813
  • [10] Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention
    Jiang, Bin
    Huang, Xin
    Yang, Chao
    Yuan, Junsong
    [J]. ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 217 - 225