CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

被引:369
作者
Luo, Huaishao [1 ]
Ji, Lei [2 ,3 ,4 ]
Zhong, Ming [5 ]
Chen, Yang [5 ]
Lei, Wen [5 ]
Duan, Nan [4 ]
Li, Tianrui [1 ]
机构
[1] Southwest Jiaotong Univ, Chengdu, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Univ Chinese Acad Sci, Beijing, Peoples R China
[4] Microsoft Res Asia, Beijing, Peoples R China
[5] Microsoft STCA, Beijing, Peoples R China
基金
美国国家科学基金会;
关键词
Video retrieval; Video captioning; CLIP;
D O I
10.1016/j.neucom.2022.07.028
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video clip retrieval and captioning tasks play an essential role in multimodal research and are the fundamental research problem for multimodal understanding and generation. The CLIP (Contrastive LanguageImage Pre-training) model has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner. Furthermore, we conduct several empirical studies including 1) Whether image feature is enough for video-text retrieval and captioning? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo for multimodal understanding and generation tasks.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:293 / 304
页数:12
相关论文
共 66 条
[1]  
Agarap A. F., 2018, arXiv
[2]  
Amrani E, 2021, AAAI CONF ARTIF INTE, V35, P6644
[3]  
[Anonymous], 2015, P 2015 C N AM CHAPT
[4]  
Arnab A., 2021, arXiv
[5]  
Bain M., 2021, arXiv
[6]   TRANSFER OF TRAINING - A REVIEW AND DIRECTIONS FOR FUTURE-RESEARCH [J].
BALDWIN, TT ;
FORD, JK .
PERSONNEL PSYCHOLOGY, 1988, 41 (01) :63-105
[7]  
Banerjee Satanjeev, 2005, P ACL WORKSH INTR EX, P65
[8]  
Bertasius G, 2021, Arxiv, DOI [arXiv:2102.05095, DOI 10.48550/ARXIV.2102.05095]
[9]  
Chen D., 2011, P 49 ANN M ASS COMP, P190
[10]   Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos [J].
Chen, Shaoxiang ;
Jiang, Wenhao ;
Liu, Wei ;
Jiang, Yu-Gang .
COMPUTER VISION - ECCV 2020, PT IV, 2020, 12349 :333-351