Learning Text-to-Video Retrieval from Image Captioning

被引：0

作者：

Ventura, Lucas ^{[1
,2
]}

Schmid, Cordelia ^{[2
]}

Varol, Gul ^{[1
]}

机构：

[1] Univ Gustave Eiffel, Ecole Ponts, LIGM, CNRS, Marne La Vallee, France

[2] PSL Res Univ, Inria, CNRS, ENS, Paris, France

来源：

INTERNATIONAL JOURNAL OF COMPUTER VISION | 2024年

关键词：

Text-to-video retrieval; Image captioning; Multimodal learning;

D O I：

10.1007/s11263-024-02202-8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD. Code and models will be made publicly available.

引用

页码：1834 / 1854

页数：21

共 50 条

[1] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
Dong, Jianfeng
Wang, Yabing
Chen, Xianke
Qu, Xiaoye
Li, Xirong
He, Yuan
Wang, Xun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
[2] An Empirical Study of Frame Selection for Text-to-Video Retrieval
Wu, Mengxia
Cao, Min
Bai, Yang
Zeng, Ziyin
Chen, Chen
Nie, Liqiang
Zhang, Min
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 6821 - 6832
[3] Holistic Features are almost Sufficient for Text-to-Video Retrieval
Tian, Kaibin
Zhao, Ruixiang
Xin, Zijie
Lan, Bangxiang
Li, Xirong
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 17138 - 17147
[4] Factorizing Text-to-Video Generation by Explicit Image Conditioning
Girdhar, Rohit
Singh, Mannat
Brown, Andrew
Duval, Quentin
Azadi, Samaneh
Rambhatla, Sai Saketh
Shah, Akbar
Yin, Xi
Parikh, Devi
Misra, Ishan
COMPUTER VISION - ECCV 2024, PT LXII, 2025, 15120 : 205 - 224
[5] Summarization of Text and Image Captioning in Information Retrieval Using Deep Learning Techniques
Mahalakshmi, P.
Fatima, N. Sabiyath
IEEE ACCESS, 2022, 10 : 18289 - 18297
[6] Visual to Text: Survey of Image and Video Captioning
Li, Sheng
Tao, Zhiqiang
Li, Kang
Fu, Yun
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2019, 3 (04): : 297 - 312
[7] Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
Tian, Kaibin
Cheng, Yanhua
Liu, Yi
Hou, Xinglin
Chen, Quan
Li, Han
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5207 - 5214
[8] Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
Hu, Fan
Chen, Aozhu
Wang, Ziyue
Zhou, Fangming
Dong, Jianfeng
Li, Xirong
COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 444 - 461
[9] Write What YouWant: Applying Text-to-Video Retrieval to Audiovisual Archives
Yang, Yuchen
ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE, 2023, 16 (04):
[10] Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
Rodriguez, Pedro
Azab, Mahmoud
Silvert, Becka
Sanchez, Renato
Labson, Linzy
Shah, Hardik
Moon, Seungwhan
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 47 - 68

← 1 2 3 4 5 →