Holistic Features are almost Sufficient for Text-to-Video Retrieval

被引:2
|
作者
Tian, Kaibin [1 ]
Zhao, Ruixiang [1 ]
Xin, Zijie [1 ,2 ]
Lan, Bangxiang [1 ]
Li, Xirong [1 ]
机构
[1] Renmin Univ China, Key Lab DEKE, MoE, Beijing, Peoples R China
[2] Sichuan Univ, Coll Comp Sci, Chengdu, Peoples R China
来源
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年
关键词
D O I
10.1109/CVPR52733.2024.01622
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods currently lead the way. Compared to CLIP4Clip which is efficient and compact, state-of-the-art models tend to compute video-text similarity through fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR applications into doubt. We propose TeachCLIP, enabling a CLIP4Clip based student network to learn from more advanced yet computationally intensive models. In order to create a learning channel to convey fine-grained cross-modal knowledge from a heavy model to the student, we add to CLIP4Clip a simple Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage / computation overhead at the retrieval stage. Frame-text relevance scores calculated by the teacher network are used as soft labels to supervise the attentive weights produced by AFA. Extensive experiments on multiple public datasets justify the viability of the proposed method. TeachCLIP has the same efficiency and compactness as CLIP4Clip, yet has near-SOTA effectiveness.
引用
收藏
页码:17138 / 17147
页数:10
相关论文
共 50 条
  • [1] Learning Text-to-Video Retrieval from Image Captioning
    Ventura, Lucas
    Schmid, Cordelia
    Varol, Gul
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, : 1834 - 1854
  • [2] An Empirical Study of Frame Selection for Text-to-Video Retrieval
    Wu, Mengxia
    Cao, Min
    Bai, Yang
    Zeng, Ziyin
    Chen, Chen
    Nie, Liqiang
    Zhang, Min
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 6821 - 6832
  • [3] Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval
    Hu, Fan
    Chen, Aozhu
    Wang, Ziyue
    Zhou, Fangming
    Dong, Jianfeng
    Li, Xirong
    COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 : 444 - 461
  • [4] Write What YouWant: Applying Text-to-Video Retrieval to Audiovisual Archives
    Yang, Yuchen
    ACM JOURNAL ON COMPUTING AND CULTURAL HERITAGE, 2023, 16 (04):
  • [5] Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
    Rodriguez, Pedro
    Azab, Mahmoud
    Silvert, Becka
    Sanchez, Renato
    Labson, Linzy
    Shah, Hardik
    Moon, Seungwhan
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 47 - 68
  • [6] Relation Triplet Construction for Cross-modal Text-to-Video Retrieval
    Song, Xue
    Chen, Jingjing
    Jiang, Yu-Gang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4759 - 4767
  • [7] Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
    Ibrahimi, Sarah
    Sun, Xiaohang
    Wang, Pichao
    Garg, Amanmeet
    Sanan, Ashutosh
    Omar, Mohamed
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12020 - 12030
  • [8] Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval
    Yin, Sh ukang
    Zhao, Sirui
    Wang, Hao
    Xu, Tong
    Chen, Enhong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (10)
  • [9] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
    Dong, Jianfeng
    Wang, Yabing
    Chen, Xianke
    Qu, Xiaoye
    Li, Xirong
    He, Yuan
    Wang, Xun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
  • [10] A W2VV++ Case Study with Automated and Interactive Text-to-Video Retrieval
    Lokoc, Jakub
    Soucek, Tomas
    Vesely, Patrik
    Mejzlik, Frantisek
    Ji, Jiaqi
    Xu, Chaoxi
    Li, Xirong
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2553 - 2561