Holistic Features are almost Sufficient for Text-to-Video Retrieval

被引：2

作者：

Tian, Kaibin ^{[1
]}

Zhao, Ruixiang ^{[1
]}

Xin, Zijie ^{[1
,2
]}

Lan, Bangxiang ^{[1
]}

Li, Xirong ^{[1
]}

机构：

[1] Renmin Univ China, Key Lab DEKE, MoE, Beijing, Peoples R China

[2] Sichuan Univ, Coll Comp Sci, Chengdu, Peoples R China

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2024年

关键词：

D O I：

10.1109/CVPR52733.2024.01622

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods currently lead the way. Compared to CLIP4Clip which is efficient and compact, state-of-the-art models tend to compute video-text similarity through fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR applications into doubt. We propose TeachCLIP, enabling a CLIP4Clip based student network to learn from more advanced yet computationally intensive models. In order to create a learning channel to convey fine-grained cross-modal knowledge from a heavy model to the student, we add to CLIP4Clip a simple Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage / computation overhead at the retrieval stage. Frame-text relevance scores calculated by the teacher network are used as soft labels to supervise the attentive weights produced by AFA. Extensive experiments on multiple public datasets justify the viability of the proposed method. TeachCLIP has the same efficiency and compactness as CLIP4Clip, yet has near-SOTA effectiveness.

引用

页码：17138 / 17147

页数：10

共 50 条

[11] Rewind and Render: Towards Factually Accurate Text-to-Video Generation with Distilled Knowledge Retrieval [J].

Lee, Daniel ;

Chandra, Arjun ;

Zhou, Yang ;

Li, Yunyao ;

Conia, Simone .

THIRTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, AAAI-25, VOL 39 NO 28, 2025, :29652-29654

[12] LONG TERM MEMORY-ENHANCED VIA CAUSAL REASONING FOR TEXT-TO-VIDEO RETRIEVAL [J].

Cheng, Dingxin ;

Kong, Shuhan ;

Wang, Wenyu ;

Qu, Meixia ;

Jiang, Bin .

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :8160-8164

[13] AniClipart: Clipart Animation with Text-to-Video Priors [J].

Wu, Ronghuan ;

Su, Wanchao ;

Ma, Kede ;

Liao, Jing .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (06) :3149-3165

[14] Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning [J].

Tian, Kaibin ;

Cheng, Yanhua ;

Liu, Yi ;

Hou, Xinglin ;

Chen, Quan ;

Li, Han .

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, :5207-5214

[15] Grid Diffusion Models for Text-to-Video Generation [J].

Lee, Taegyeong ;

Kwon, Soyeong ;

Kim, Taehwan .

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, :8734-8743

[16] WAVE: Warping DDIM Inversion Features for Zero-Shot Text-to-Video Editing [J].

Feng, Yutang ;

Gao, Sicheng ;

Bao, Yuxiang ;

Wang, Xiaodi ;

Han, Shumin ;

Zhang, Juan ;

Zhang, Baochang ;

Yao, Angela .

COMPUTER VISION - ECCV 2024, PT LXXVI, 2025, 15134 :38-55

[17] Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [J].

Menapace, Willi ;

Siarohin, Aliaksandr ;

Skorokhodov, Ivan ;

Deyneka, Ekaterina ;

Chen, Tsai-Shien ;

Kag, Anil ;

Fang, Yuwei ;

Stoliar, Aleksei ;

Ricci, Elisa ;

Ren, Jian ;

Tulyakov, Sergey .

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, :7038-7048

[18] MEVG: Multi-event Video Generation with Text-to-Video Models [J].

Oh, Gyeongrok ;

Jeong, Jaehwan ;

Kim, Sieun ;

Byeon, Wonmin ;

Kim, Jinkyu ;

Kim, Sungwoong ;

Kim, Sangpil .

COMPUTER VISION-ECCV 2024, PT XLIII, 2025, 15101 :401-418

[19] ImproveYourVideos: Architectural Improvements for Text-to-Video Generation Pipeline [J].

Arkhipkin, Vladimir ;

Shaheen, Zein ;

Vasilev, Viacheslav ;

Dakhova, Elizaveta ;

Sobolev, Konstantin ;

Kuznetsov, Andrey ;

Dimitrov, Denis .

IEEE ACCESS, 2025, 13 :1986-2003

[20] Text-to-video generative artificial intelligence: sora in neurosurgery [J].

Mohamed, Ali A. ;

Lucke-Wold, Brandon .

NEUROSURGICAL REVIEW, 2024, 47 (01)

← 1 2 3 4 5 →