Learning Text-to-Video Retrieval from Image Captioning

被引:0
|
作者
Ventura, Lucas [1 ,2 ]
Schmid, Cordelia [2 ]
Varol, Gul [1 ]
机构
[1] Univ Gustave Eiffel, Ecole Ponts, LIGM, CNRS, Marne La Vallee, France
[2] PSL Res Univ, Inria, CNRS, ENS, Paris, France
关键词
Text-to-video retrieval; Image captioning; Multimodal learning;
D O I
10.1007/s11263-024-02202-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD. Code and models will be made publicly available.
引用
收藏
页码:1834 / 1854
页数:21
相关论文
共 50 条
  • [1] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
    Dong, Jianfeng
    Wang, Yabing
    Chen, Xianke
    Qu, Xiaoye
    Li, Xirong
    He, Yuan
    Wang, Xun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
  • [2] An Empirical Study of Frame Selection for Text-to-Video Retrieval
    Wu, Mengxia
    Cao, Min
    Bai, Yang
    Zeng, Ziyin
    Chen, Chen
    Nie, Liqiang
    Zhang, Min
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 6821 - 6832
  • [3] Holistic Features are almost Sufficient for Text-to-Video Retrieval
    Tian, Kaibin
    Zhao, Ruixiang
    Xin, Zijie
    Lan, Bangxiang
    Li, Xirong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 17138 - 17147
  • [4] Factorizing Text-to-Video Generation by Explicit Image Conditioning
    Girdhar, Rohit
    Singh, Mannat
    Brown, Andrew
    Duval, Quentin
    Azadi, Samaneh
    Rambhatla, Sai Saketh
    Shah, Akbar
    Yin, Xi
    Parikh, Devi
    Misra, Ishan
    COMPUTER VISION - ECCV 2024, PT LXII, 2025, 15120 : 205 - 224
  • [5] Summarization of Text and Image Captioning in Information Retrieval Using Deep Learning Techniques
    Mahalakshmi, P.
    Fatima, N. Sabiyath
    IEEE ACCESS, 2022, 10 : 18289 - 18297
  • [6] Visual to Text: Survey of Image and Video Captioning
    Li, Sheng
    Tao, Zhiqiang
    Li, Kang
    Fu, Yun
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2019, 3 (04): : 297 - 312
  • [7] Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
    Tian, Kaibin
    Cheng, Yanhua
    Liu, Yi
    Hou, Xinglin
    Chen, Quan
    Li, Han
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5207 - 5214
  • [8] Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
    Hu, Fan
    Chen, Aozhu
    Wang, Ziyue
    Zhou, Fangming
    Dong, Jianfeng
    Li, Xirong
    COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 444 - 461
  • [9] Write What YouWant: Applying Text-to-Video Retrieval to Audiovisual Archives
    Yang, Yuchen
    ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE, 2023, 16 (04):
  • [10] Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
    Rodriguez, Pedro
    Azab, Mahmoud
    Silvert, Becka
    Sanchez, Renato
    Labson, Linzy
    Shah, Hardik
    Moon, Seungwhan
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 47 - 68