Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

被引:0
作者
Tian, Kaibin [1 ]
Cheng, Yanhua [1 ]
Liu, Yi [1 ]
Hou, Xinglin [1 ]
Chen, Quan [1 ]
Li, Han [1 ]
机构
[1] Kuaishou Technol, Beijing, Peoples R China
来源
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6 | 2024年
关键词
CLIP;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless of the prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms of feature utilization and retrieval efficiency. To address this issue, we adopt multi-granularity visual feature learning, ensuring the model's comprehensiveness in capturing visual content features spanning from abstract to detailed levels during the training phase. To better leverage the multi-granularity features, we devise a two-stage retrieval architecture in the retrieval phase. This solution ingeniously balances the coarse and fine granularity of retrieval content. Moreover, it also strikes a harmonious equilibrium between retrieval effectiveness and efficiency. Specifically, in training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning and embed an extra Pearson Constraint to optimize cross-modal representation learning. In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations. Extensive experiments on four benchmarks demonstrate the efficiency and effectiveness. Notably, our method achieves comparable performance with the current state-of-the-art methods while being nearly 50 times faster.
引用
收藏
页码:5207 / 5214
页数:8
相关论文
共 30 条
[1]  
[Anonymous], 2011, P 49 ANN M ASS COMPU
[2]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[3]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[4]   CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification [J].
Conde, Marcos, V ;
Turgutlu, Kerem .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :3951-3955
[5]   TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [J].
Croitoru, Ioana ;
Bogolin, Simion-Vlad ;
Leordeanu, Marius ;
Jin, Hailin ;
Zisserman, Andrew ;
Albanie, Samuel ;
Liu, Yang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11563-11573
[6]  
Fang Han, 2021, Arxiv
[7]   Bridging Video-text Retrieval with Multiple Choice Questions [J].
Ge, Yuying ;
Ge, Yixiao ;
Liu, Xihui ;
Li, Dian ;
Shan, Ying ;
Qie, Xiaohu ;
Luo, Ping .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :16146-16155
[8]   X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [J].
Gorti, Satya Krishna ;
Vouitsis, Noel ;
Ma, Junwei ;
Golestan, Keyvan ;
Volkovs, Maksims ;
Garg, Animesh ;
Yu, Guangwei .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4996-5005
[9]   Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval [J].
Hu, Fan ;
Chen, Aozhu ;
Wang, Ziyue ;
Zhou, Fangming ;
Dong, Jianfeng ;
Li, Xirong .
COMPUTER VISION - ECCV 2022, PT XIV, 2022, 13674 :444-461
[10]   Scaling Up Vision-Language Pre-training for Image Captioning [J].
Hu, Xiaowei ;
Gan, Zhe ;
Wang, Jianfeng ;
Yang, Zhengyuan ;
Liu, Zicheng ;
Lu, Yumao ;
Wang, Lijuan .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :17959-17968