TEACHTEXT: CrossModal text-video retrieval through generalized distillation

被引:0
|
作者
Croitoru, Ioana [1 ,2 ]
Bogolin, Simion-Vlad [1 ,2 ]
Leordeanu, Marius [3 ]
Jin, Hailin [4 ]
Zisserman, Andrew [1 ]
Liu, Yang [1 ,5 ]
Albanie, Samuel [6 ]
机构
[1] Univ Oxford, Visual Geometry Grp, Oxford, England
[2] Romanian Acad, Inst Math, Bucharest, Romania
[3] Univ Politehn Bucuresti, Bucharest, Romania
[4] Adobe Res, San Jose, CA USA
[5] Peking Univ, Wangxuan Inst Comp Technol, Beijing, Peoples R China
[6] Univ Cambridge, Dept Engn, Cambridge, England
基金
英国工程与自然科学研究理事会;
关键词
Text-video retrieval; Distillation; Text embeddings; Video experts;
D O I
10.1016/j.artint.2024.104235
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, considerable progress on the task of text-video retrieval has been achieved by leveraging large-scale pretraining on visual and audio datasets to construct powerful video encoders. By contrast, despite the natural symmetry, the design of effective algorithms for exploiting large-scale language pretraining remains under-explored. In this work, we investigate the design of such algorithms and propose a novel generalized distillation method, TEACHTEXT, which leverages complementary cues from multiple text encoders to provide an enhanced supervisory signal to the retrieval model. TEACHTEXT yields significant gains on a number of video retrieval benchmarks without incurring additional computational overhead during inference and was used to produce the winning entry in the Condensed Movie Challenge at ICCV 2021. We show how TEACHTEXT can be extended to include multiple video modalities, reducing computational cost at inference without compromising performance. Finally, we demonstrate the application of our method to the task of removing noisy descriptions from the training partitions of retrieval datasets to improve performance. Code and data can be found at https://www.robots.ox.ac.uk/similar to vgg/research/teachtext/.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] Text-video retrieval method based on enhanced self-attention and multi-task learning
    Xiaoyu Wu
    Jiayao Qian
    Tiantian Wang
    Multimedia Tools and Applications, 2023, 82 : 24387 - 24406
  • [42] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval
    Ma, Wentao
    Chen, Qingchao
    Zhou, Tongqing
    Zhao, Shan
    Cai, Zhiping
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (10) : 5486 - 5497
  • [43] Text-video retrieval method based on enhanced self-attention and multi-task learning
    Wu, Xiaoyu
    Qian, Jiayao
    Wang, Tiantian
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (16) : 24387 - 24406
  • [44] Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection
    Zhou, Siyu
    Zhang, Fjwei
    Wang, Ruomei
    Su, Zhuo
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 254 - 268
  • [45] Face database generation based on text-video correlation
    Zeng, Dan
    Bao, Yixin
    Liu, Ke
    Zhao, Fan
    Tian, Qi
    NEUROCOMPUTING, 2016, 207 : 240 - 249
  • [46] CelebV-Text: A Large-Scale Facial Text-Video Dataset
    Yu, Jianhui
    Zhu, Hao
    Jiang, Liming
    Loy, Chen Change
    Cai, Weidong
    Wu, Wayne
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14805 - 14814
  • [47] Cross-Modal Learning Based on Semantic Correlation and Multi-Task Learning for Text-Video Retrieval
    Wu, Xiaoyu
    Wang, Tiantian
    Wang, Shengjin
    ELECTRONICS, 2020, 9 (12) : 1 - 17
  • [48] Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
    Jiang, Chen
    Liu, Hong
    Yu, Xuzheng
    Wang, Qing
    Cheng, Yuan
    Xu, Jia
    Liu, Zhongyi
    Guo, Qingpei
    Chu, Wei
    Yang, Ming
    Qi, Yuan
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4626 - 4636
  • [49] SPSD: Similarity-preserving self-distillation for video–text retrieval
    Jiachen Wang
    Yan Hua
    Yingyun Yang
    Hongwei Kou
    International Journal of Multimedia Information Retrieval, 2023, 12
  • [50] Text-Video Completion Using Structure Repair and Texture Propagation
    Tsai, Tsung-Han
    Fang, Chih-Lun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2011, 13 (01) : 29 - 39