End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval

被引：7

作者：

Shen, Wenxue ^{[1
]}

Song, Jingkuan ^{[2
,3
]}

Zhu, Xiaosu ^{[1
]}

Li, Gongfu ^{[4
]}

Shen, Heng Tao ^{[1
,3
]}

机构：

[1] Univ Elect Sci & Technol China, Ctr Future Media, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China

[2] Univ Elect Sci & Technol China, Shenzhen Inst Adv Study, Shenzhen 518110, Peoples R China

[3] Peng Cheng Lab, Shenzhen 518066, Peoples R China

[4] Tencent Inc, Corp Dev Grp, Shenzhen 518057, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2023年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Videos; Hidden Markov models; Semantics; Task analysis; Transformers; Training; Feature extraction; Multimodal pre-training; video retrieval; contrastive learning;

D O I：

10.1109/TIP.2023.3275071

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Lately, video-language pre-training and text-video retrieval have attracted significant attention with the explosion of multimedia data on the Internet. However, existing approaches for video-language pre-training typically limit the exploitation of the hierarchical semantic information in videos, such as frame semantic information and global video semantic information. In this work, we present an end-to-end pre-training network with Hierarchical Matching and Momentum Contrast named HMMC. The key idea is to explore the hierarchical semantic information in videos via multilevel semantic matching between videos and texts. This design is motivated by the observation that if a video semantically matches a text (can be a title, tag or caption), the frames in this video usually have semantic connections with the text and show higher similarity than frames in other videos. Hierarchical matching is mainly realized by two proxy tasks: Video-Text Matching (VTM) and Frame-Text Matching (FTM). Another proxy task: Frame Adjacency Matching (FAM) is proposed to enhance the single visual modality representations while training from scratch. Furthermore, momentum contrast framework was introduced into HMMC to form a multimodal momentum contrast framework, enabling HMMC to incorporate more negative samples for contrastive learning which contributes to the generalization of representations. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods. We release our code at https://github.com/cheetah003/HMMC.

引用

页码：5017 / 5030

页数：14

共 8 条

[1] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Shu, Fangxun
Chen, Biaolong
Liao, Yue
Wang, Jinqiao
Liu, Si
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9962 - 9972
[2] Modularized Pre-Training for End-to-End Task-Oriented Dialogue
Qin, Libo
Xu, Xiao
Wang, Lehan
Zhang, Yue
Che, Wanxiang
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1601 - 1610
[3] VSRNet: End-to-end video segment retrieval with text query
Sun, Xiao
Long, Xiang
He, Dongliang
Wen, Shilei
Lian, Zhouhui
PATTERN RECOGNITION, 2021, 119 (119)
[4] FREE: A Fast and Robust End-to-End Video Text Spotter
Cheng, Zhanzhan
Lu, Jing
Zou, Baorui
Qiao, Liang
Xu, Yunlu
Pu, Shiliang
Niu, Yi
Wu, Fei
Zhou, Shuigeng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 822 - 837
[5] Learnable Query Contrast and Spatio-temporal Prediction on Point Cloud Video Pre-training
Sheng, Xiaoxiao
Shen, Zhiqiang
Wang, Longguang
Xiao, Gang
IEEE LATIN AMERICA TRANSACTIONS, 2024, 22 (10) : 821 - 828
[6] Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training
Ahmad, Hawraz A.
Rashid, Tarik A.
ALGORITHMS, 2024, 17 (07)
[7] A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos
Gioele Ciaparrone
Leonardo Chiariglione
Roberto Tagliaferri
Neural Computing and Applications, 2022, 34 : 7489 - 7506
[8] A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos
Ciaparrone, Gioele
Chiariglione, Leonardo
Tagliaferri, Roberto
NEURAL COMPUTING & APPLICATIONS, 2022, 34 (10): : 7489 - 7506

← 1 →