Holistic Features are almost Sufficient for Text-to-Video Retrieval

被引:2
|
作者
Tian, Kaibin [1 ]
Zhao, Ruixiang [1 ]
Xin, Zijie [1 ,2 ]
Lan, Bangxiang [1 ]
Li, Xirong [1 ]
机构
[1] Renmin Univ China, Key Lab DEKE, MoE, Beijing, Peoples R China
[2] Sichuan Univ, Coll Comp Sci, Chengdu, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年
关键词
D O I
10.1109/CVPR52733.2024.01622
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods currently lead the way. Compared to CLIP4Clip which is efficient and compact, state-of-the-art models tend to compute video-text similarity through fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR applications into doubt. We propose TeachCLIP, enabling a CLIP4Clip based student network to learn from more advanced yet computationally intensive models. In order to create a learning channel to convey fine-grained cross-modal knowledge from a heavy model to the student, we add to CLIP4Clip a simple Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage / computation overhead at the retrieval stage. Frame-text relevance scores calculated by the teacher network are used as soft labels to supervise the attentive weights produced by AFA. Extensive experiments on multiple public datasets justify the viability of the proposed method. TeachCLIP has the same efficiency and compactness as CLIP4Clip, yet has near-SOTA effectiveness.
引用
收藏
页码:17138 / 17147
页数:10
相关论文
共 50 条
  • [11] Grid Diffusion Models for Text-to-Video Generation
    Lee, Taegyeong
    Kwon, Soyeong
    Kim, Taehwan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 8734 - 8743
  • [12] WAVE: Warping DDIM Inversion Features for Zero-Shot Text-to-Video Editing
    Feng, Yutang
    Gao, Sicheng
    Bao, Yuxiang
    Wang, Xiaodi
    Han, Shumin
    Zhang, Juan
    Zhang, Baochang
    Yao, Angela
    COMPUTER VISION - ECCV 2024, PT LXXVI, 2025, 15134 : 38 - 55
  • [13] Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
    Tian, Kaibin
    Cheng, Yanhua
    Liu, Yi
    Hou, Xinglin
    Chen, Quan
    Li, Han
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5207 - 5214
  • [14] MEVG: Multi-event Video Generation with Text-to-Video Models
    Oh, Gyeongrok
    Jeong, Jaehwan
    Kim, Sieun
    Byeon, Wonmin
    Kim, Jinkyu
    Kim, Sungwoong
    Kim, Sangpil
    COMPUTER VISION-ECCV 2024, PT XLIII, 2025, 15101 : 401 - 418
  • [15] ImproveYourVideos: Architectural Improvements for Text-to-Video Generation Pipeline
    Arkhipkin, Vladimir
    Shaheen, Zein
    Vasilev, Viacheslav
    Dakhova, Elizaveta
    Sobolev, Konstantin
    Kuznetsov, Andrey
    Dimitrov, Denis
    IEEE ACCESS, 2025, 13 : 1986 - 2003
  • [16] Text-to-video generative artificial intelligence: sora in neurosurgery
    Mohamed, Ali A.
    Lucke-Wold, Brandon
    NEUROSURGICAL REVIEW, 2024, 47 (01)
  • [17] Text-to-video: a semantic search engine for internet videos
    Jiang, Lu
    Yu, Shoou-, I
    Meng, Deyu
    Mitamura, Teruko
    Hauptmann, Alexander G.
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2016, 5 (01) : 3 - 18
  • [18] Breathing Life Into Sketches Using Text-to-Video Priors
    Gal, Rinon
    Vinker, Yael
    Alaluf, Yuval
    Bermano, Amit
    Cohen-Or, Daniel
    Shamir, Ariel
    Chechik, Gal
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 4325 - 4336
  • [19] Factorizing Text-to-Video Generation by Explicit Image Conditioning
    Girdhar, Rohit
    Singh, Mannat
    Brown, Andrew
    Duval, Quentin
    Azadi, Samaneh
    Rambhatla, Sai Saketh
    Shah, Akbar
    Yin, Xi
    Parikh, Devi
    Misra, Ishan
    COMPUTER VISION - ECCV 2024, PT LXII, 2025, 15120 : 205 - 224
  • [20] Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
    Wang, Wenjing
    Yang, Huan
    Tuo, Zixi
    He, Huiguo
    Zhu, Junchen
    Fu, Jianlong
    Liu, Jiaying
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,