End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval

被引:7
|
作者
Shen, Wenxue [1 ]
Song, Jingkuan [2 ,3 ]
Zhu, Xiaosu [1 ]
Li, Gongfu [4 ]
Shen, Heng Tao [1 ,3 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China, Shenzhen Inst Adv Study, Shenzhen 518110, Peoples R China
[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China
[4] Tencent Inc, Corp Dev Grp, Shenzhen 518057, Peoples R China
基金
中国国家自然科学基金;
关键词
Videos; Hidden Markov models; Semantics; Task analysis; Transformers; Training; Feature extraction; Multimodal pre-training; video retrieval; contrastive learning;
D O I
10.1109/TIP.2023.3275071
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Lately, video-language pre-training and text-video retrieval have attracted significant attention with the explosion of multimedia data on the Internet. However, existing approaches for video-language pre-training typically limit the exploitation of the hierarchical semantic information in videos, such as frame semantic information and global video semantic information. In this work, we present an end-to-end pre-training network with Hierarchical Matching and Momentum Contrast named HMMC. The key idea is to explore the hierarchical semantic information in videos via multilevel semantic matching between videos and texts. This design is motivated by the observation that if a video semantically matches a text (can be a title, tag or caption), the frames in this video usually have semantic connections with the text and show higher similarity than frames in other videos. Hierarchical matching is mainly realized by two proxy tasks: Video-Text Matching (VTM) and Frame-Text Matching (FTM). Another proxy task: Frame Adjacency Matching (FAM) is proposed to enhance the single visual modality representations while training from scratch. Furthermore, momentum contrast framework was introduced into HMMC to form a multimodal momentum contrast framework, enabling HMMC to incorporate more negative samples for contrastive learning which contributes to the generalization of representations. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods. We release our code at https://github.com/cheetah003/HMMC.
引用
收藏
页码:5017 / 5030
页数:14
相关论文
共 8 条
  • [1] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
    Shu, Fangxun
    Chen, Biaolong
    Liao, Yue
    Wang, Jinqiao
    Liu, Si
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9962 - 9972
  • [2] Modularized Pre-Training for End-to-End Task-Oriented Dialogue
    Qin, Libo
    Xu, Xiao
    Wang, Lehan
    Zhang, Yue
    Che, Wanxiang
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1601 - 1610
  • [3] VSRNet: End-to-end video segment retrieval with text query
    Sun, Xiao
    Long, Xiang
    He, Dongliang
    Wen, Shilei
    Lian, Zhouhui
    PATTERN RECOGNITION, 2021, 119 (119)
  • [4] FREE: A Fast and Robust End-to-End Video Text Spotter
    Cheng, Zhanzhan
    Lu, Jing
    Zou, Baorui
    Qiao, Liang
    Xu, Yunlu
    Pu, Shiliang
    Niu, Yi
    Wu, Fei
    Zhou, Shuigeng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 822 - 837
  • [5] Learnable Query Contrast and Spatio-temporal Prediction on Point Cloud Video Pre-training
    Sheng, Xiaoxiao
    Shen, Zhiqiang
    Wang, Longguang
    Xiao, Gang
    IEEE LATIN AMERICA TRANSACTIONS, 2024, 22 (10) : 821 - 828
  • [6] Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training
    Ahmad, Hawraz A.
    Rashid, Tarik A.
    ALGORITHMS, 2024, 17 (07)
  • [7] A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos
    Gioele Ciaparrone
    Leonardo Chiariglione
    Roberto Tagliaferri
    Neural Computing and Applications, 2022, 34 : 7489 - 7506
  • [8] A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos
    Ciaparrone, Gioele
    Chiariglione, Leonardo
    Tagliaferri, Roberto
    NEURAL COMPUTING & APPLICATIONS, 2022, 34 (10): : 7489 - 7506