TEACHTEXT: CrossModal text-video retrieval through generalized distillation

被引：0

作者：

Croitoru, Ioana ^{[1
,2
]}

Bogolin, Simion-Vlad ^{[1
,2
]}

Leordeanu, Marius ^{[3
]}

Jin, Hailin ^{[4
]}

Zisserman, Andrew ^{[1
]}

Liu, Yang ^{[1
,5
]}

Albanie, Samuel ^{[6
]}

机构：

[1] Univ Oxford, Visual Geometry Grp, Oxford, England

[2] Romanian Acad, Inst Math, Bucharest, Romania

[3] Univ Politehn Bucuresti, Bucharest, Romania

[4] Adobe Res, San Jose, CA USA

[5] Peking Univ, Wangxuan Inst Comp Technol, Beijing, Peoples R China

[6] Univ Cambridge, Dept Engn, Cambridge, England

来源：

ARTIFICIAL INTELLIGENCE | 2025年 / 338卷

基金：

英国工程与自然科学研究理事会;

关键词：

Text-video retrieval; Distillation; Text embeddings; Video experts;

D O I：

10.1016/j.artint.2024.104235

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, considerable progress on the task of text-video retrieval has been achieved by leveraging large-scale pretraining on visual and audio datasets to construct powerful video encoders. By contrast, despite the natural symmetry, the design of effective algorithms for exploiting large-scale language pretraining remains under-explored. In this work, we investigate the design of such algorithms and propose a novel generalized distillation method, TEACHTEXT, which leverages complementary cues from multiple text encoders to provide an enhanced supervisory signal to the retrieval model. TEACHTEXT yields significant gains on a number of video retrieval benchmarks without incurring additional computational overhead during inference and was used to produce the winning entry in the Condensed Movie Challenge at ICCV 2021. We show how TEACHTEXT can be extended to include multiple video modalities, reducing computational cost at inference without compromising performance. Finally, we demonstrate the application of our method to the task of removing noisy descriptions from the training partitions of retrieval datasets to improve performance. Code and data can be found at https://www.robots.ox.ac.uk/similar to vgg/research/teachtext/.

引用

页数：20

共 50 条

[1] TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval
Croitoru, Ioana
Bogolin, Simion-Vlad
Leordeanu, Marius
Jin, Hailin
Zisserman, Andrew
Albanie, Samuel
Liu, Yang
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11563 - 11573
[2] Text-guided distillation learning to diversify video embeddings for text-video retrieval
Lee, Sangmin
Kim, Hyung-Il
Ro, Yong Man
PATTERN RECOGNITION, 2024, 156
[3] Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Wang, Jiamian
Sun, Guohao
Wang, Pichao
Liu, Dongfang
Dianat, Sohail
Rabbanil, Majid
Rao, Raghuveer
Tao, Zhigang
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 16551 - 16560
[4] KnowER: Knowledge enhancement for efficient text-video retrieval
Kou H.
Yang Y.
Hua Y.
Intelligent and Converged Networks, 2023, 4 (02): : 93 - 105
[5] UATVR: Uncertainty-Adaptive Text-Video Retrieval
Fang, Bo
Wu, Wenhao
Liu, Chang
Zhou, Yu
Song, Yuxin
Wang, Weiping
Shu, Xiangbo
Ji, Xiangyang
Wang, Jingdong
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13677 - 13687
[6] DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
Jin, Peng
Li, Hao
Cheng, Zesen
Li, Kehan
Ji, Xiangyang
Liu, Chang
Yuan, Li
Chen, Jie
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2470 - 2481
[7] Dynamic semantic prototype perception for text-video retrieval
Zhao, Henghao
Yan, Rui
Li, Zechao
IMAGE AND VISION COMPUTING, 2025, 158
[8] CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
Zhao, Shuai
Zhu, Linchao
Wang, Xiaohan
Yang, Yi
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 970 - 981
[9] Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
Deng, Chaorui
Chen, Qi
Qin, Pengda
Chen, Da
Wu, Qi
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15602 - 15612
[10] Learning Linguistic Association Towards Efficient Text-Video Retrieval
Fang, Sheng
Wang, Shuhui
Zhuo, Junbao
Han, Xinzhe
Huang, Qingming
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 254 - 270

← 1 2 3 4 5 →