Text-guided distillation learning to diversify video embeddings for text-video retrieval

被引：0

作者：

Lee, Sangmin ^{[1
]}

Kim, Hyung-Il ^{[2
]}

Ro, Yong Man ^{[3
]}

机构：

[1] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA

[2] Elect & Telecommun Res Inst, Visual Intelligence Res Sect, Daejeon 34129, South Korea

[3] Korea Adv Inst Sci & Technol, Image & Video Syst Lab, Daejeon 34141, South Korea

来源：

PATTERN RECOGNITION | 2024年 / 156卷

关键词：

text-video retrieval; Diverse video embedding; Text-guided distillation learning; Text-agnostic; One-to-many correspondence;

D O I：

10.1016/j.patcog.2024.110754

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Conventional text-video retrieval methods typically match a video with a text on a one-to-one manner. However, a single video can contain diverse semantics, and text descriptions can vary significantly. Therefore, such methods fail to match a video with multiple texts simultaneously. In this paper, we propose a novel approach to tackle this one-to-many correspondence problem in text-video retrieval. We devise diverse temporal aggregation and a multi-key memory to address temporal and semantic diversity, consequently constructing multiple video embedding paths from a single video. Additionally, we introduce text-guided distillation learning that enables each video path to acquire meaningful distinct competencies in representing varied semantics. Our video embedding approach is text-agnostic, allowing the prepared video embeddings to be used continuously for any new text query. Experiments show our method outperforms existing methods on four datasets. We further validate the effectiveness of our designs with ablation studies and analyses on diverse video embeddings.

引用

页数：10

共 24 条

[21] CMFG: Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval
Zhao, Shengwei
Liu, Yuying
Du, Shaoyi
Tian, Zhiqiang
Qu, Ting
Xu, Linhai
MULTIMEDIA MODELING, MMM 2023, PT II, 2023, 13834 : 435 - 445
[22] Text-video retrieval re-ranking via multi-grained cross attention and frozen image encoders
Dai, Zuozhuo
Cheng, Kaihui
Shao, Fangtao
Dong, Zilong
Zhu, Siyu
PATTERN RECOGNITION, 2025, 159
[23] Linguistic Hallucination for Text-Based Video Retrieval
Fang, Sheng
Dang, Tiantian
Wang, Shuhui
Huang, Qingming
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9692 - 9705
[24] Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning
Lu, Yu
Quan, Ruijie
Zhu, Linchao
Yang, Yi
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6748 - 6760

← 1 2 3 →