Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

被引:0
|
作者
Tian, Kaibin [1 ]
Cheng, Yanhua [1 ]
Liu, Yi [1 ]
Hou, Xinglin [1 ]
Chen, Quan [1 ]
Li, Han [1 ]
机构
[1] Kuaishou Technol, Beijing, Peoples R China
关键词
CLIP;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.
引用
收藏
页码:5207 / 5214
页数:8
相关论文
共 50 条
  • [41] Leveraging Coarse-to-Fine Grained Representations in Contrastive Learning for Differential Medical Visual Question Answering
    Liang, Xiao
    Wang, Yin
    Wang, Di
    Jiao, Zhicheng
    Zhong, Haodi
    Yang, Mengyu
    Wang, Quan
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT V, 2024, 15005 : 415 - 425
  • [42] Concession-First Learning and Coarse-to-Fine Retrieval for Open-Domain Conversational Question Answering
    Li, Xibo
    Zou, Bowei
    Dong, Mengxing
    Yao, Jianmin
    Hong, Yu
    2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 317 - 324
  • [43] Effective and Efficient Sports Play Retrieval with Deep Representation Learning
    Wang, Zheng
    Long, Cheng
    Cong, Gao
    Ju, Ce
    KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, : 499 - 509
  • [44] Efficient text-to-video retrieval via multi-modal multi-tagger derived pre-screening
    Yingjia Xu
    Mengxia Wu
    Zixin Guo
    Min Cao
    Mang Ye
    Jorma Laaksonen
    Visual Intelligence, 2025, 3 (1):
  • [45] Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval
    Shi, Yaya
    Liu, Haowei
    Xu, Haiyang
    Ma, Zongyang
    Ye, Qinghao
    Hu, Anwen
    Yan, Ming
    Zhang, Ji
    Huang, Fei
    Yuan, Chunfeng
    Li, Bing
    Hu, Weiming
    Zha, Zheng-Jun
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4460 - 4470
  • [46] Learning Low-Rank and Sparse Discriminative Correlation Filters for Coarse-to-Fine Visual Object Tracking
    Xu, Tianyang
    Feng, Zhen-Hua
    Wu, Xiao-Jun
    Kittler, Josef
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (10) : 3727 - 3739
  • [47] Text-guided visual representation learning for medical image retrieval systems
    Serieys, Guillaume
    Kurtz, Camille
    Fournier, Laure
    Cloppet, Florence
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 593 - 598
  • [48] Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
    Singh, Harman
    Zhang, Pengchuan
    Wang, Qifan
    Wang, Mengjiao
    Xiong, Wenhan
    Du, Jingfei
    Chen, Yu
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 869 - 893
  • [49] Learning Adaptive Spatio-Temporal Inference Transformer for Coarse-to-Fine Animal Visual Tracking: Algorithm and Benchmark
    Xu, Tianyang
    Kang, Ze
    Zhu, Xuefeng
    Wu, Xiao-Jun
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (07) : 2698 - 2712
  • [50] A context constraint and sparse learning based on correlation filter for high-confidence coarse-to-fine visual tracking
    Su, Yinqiang
    Xu, Fang
    Wang, Zhongshi
    Sun, Mingchao
    Zhao, Hui
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 268