Holistic Features are almost Sufficient for Text-to-Video Retrieval

被引:4
作者
Tian, Kaibin [1 ]
Zhao, Ruixiang [1 ]
Xin, Zijie [1 ,2 ]
Lan, Bangxiang [1 ]
Li, Xirong [1 ]
机构
[1] Renmin Univ China, Key Lab DEKE, MoE, Beijing, Peoples R China
[2] Sichuan Univ, Coll Comp Sci, Chengdu, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年
关键词
D O I
10.1109/CVPR52733.2024.01622
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods currently lead the way. Compared to CLIP4Clip which is efficient and compact, state-of-the-art models tend to compute video-text similarity through fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR applications into doubt. We propose TeachCLIP, enabling a CLIP4Clip based student network to learn from more advanced yet computationally intensive models. In order to create a learning channel to convey fine-grained cross-modal knowledge from a heavy model to the student, we add to CLIP4Clip a simple Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage / computation overhead at the retrieval stage. Frame-text relevance scores calculated by the teacher network are used as soft labels to supervise the attentive weights produced by AFA. Extensive experiments on multiple public datasets justify the viability of the proposed method. TeachCLIP has the same efficiency and compactness as CLIP4Clip, yet has near-SOTA effectiveness.
引用
收藏
页码:17138 / 17147
页数:10
相关论文
共 50 条
[31]   FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing [J].
Zhang, Youyuan ;
Ju, Xuan ;
Clark, James J. .
2025 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION, WACV, 2025, :3657-3666
[32]   MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation [J].
Wang, Yanhui ;
Bao, Jianmin ;
Weng, Wenming ;
Feng, Ruoyu ;
Yin, Dacheng ;
Yang, Tao ;
Zhang, Jingxu ;
Dai, Qi ;
Zhao, Zhiyuan ;
Wang, Chunyu ;
Qiu, Kai ;
Yuan, Yuhui ;
Sun, Xiaoyan ;
Luo, Chong ;
Guo, Baining .
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, :8414-8424
[33]   Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis [J].
Balaji, Yogesh ;
Min, Martin Renqiang ;
Bai, Bing ;
Chellappa, Rama ;
Graf, Hans Peter .
PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, :1995-2001
[34]   Text-to-video generative artificial intelligence: sora in neurosurgery: correspondence [J].
Daungsupawong, Hinpetch ;
Wiwanitkit, Viroj .
NEUROSURGICAL REVIEW, 2024, 47 (01)
[35]   A dataset of text prompts, videos and video quality metrics from generative text-to-video AI models [J].
Chivileva, Iya ;
Lynch, Philip ;
Ward, Tomas E. ;
Smeaton, Alan F. .
DATA IN BRIEF, 2024, 54
[36]   Exploring the Limits of VLMs: A Dataset for Evaluating Text-to-Video Generation [J].
Srivastava, Avnish ;
Sista, Raviteja ;
Chakrabarti, Partha P. ;
Sheet, Debdoot .
PROCEEDINGS OF FIFTEENTH INDIAN CONFERENCE ON COMPUTER VISION, GRAPHICS AND IMAGE PROCESSING, ICVGIP 2024, 2024,
[37]   ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions [J].
Zhang, Yipeng ;
Wang, Xin ;
Chen, Hong ;
Qin, Chenyang ;
Hao, Yibo ;
Mei, Hong ;
Zhu, Wenwu .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (07) :4909-4922
[38]   Text-to-Video: Story Illustration from Online Photo Collections [J].
Schwarz, Katharina ;
Rojtberg, Pavel ;
Caspar, Joachim ;
Gurevych, Iryna ;
Goesele, Michael ;
Lensch, Hendrik P. A. .
KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT IV, 2010, 6279 :402-+
[39]   Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [J].
Qing, Zhiwu ;
Zhang, Shiwei ;
Wang, Jiayu ;
Wang, Xiang ;
Wei, Yujie ;
Zhang, Yingya ;
Gao, Changxin ;
Sang, Nong .
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, :6635-6645
[40]   Predicting Visual Features From Text for Image and Video Caption Retrieval [J].
Dong, Jianfeng ;
Li, Xirong ;
Snoek, Cees G. M. .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (12) :3377-3388