Deep learning for video-text retrieval: a review

被引:0
作者
Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
机构
[1] Dalian University of Technology,
[2] National University of Defense Technology,undefined
来源
International Journal of Multimedia Information Retrieval | 2023年 / 12卷
关键词
Deep learning; Video-text retrieval; Cross-modal representation; Feature matching; Metric learning;
D O I
暂无
中图分类号
学科分类号
摘要
Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. In general, this retrieval task is composed of four successive steps: video and textual feature representation extraction, feature embedding and matching, and objective functions. In the last, a list of samples retrieved from the dataset is ranked based on their matching similarities to the query. In recent years, significant and flourishing progress has been achieved by deep learning techniques, however, VTR is still a challenging task due to the problems like how to learn an efficient spatial-temporal video feature and how to narrow the cross-modal gap. In this survey, we review and summarize over 100 research papers related to VTR, demonstrate state-of-the-art performance on several commonly benchmarked datasets, and discuss potential challenges and directions, with the expectation to provide some insights for researchers in the field of video-text retrieval.
引用
收藏
相关论文
共 58 条
[1]  
Amrani E(2021)Noise estimation using density estimation for self-supervised multimodal learning Proc AAAI Conf Artif Intell 35 6644-6652
[2]  
Ben-Ari R(1994)Learning long-term dependencies with gradient descent is difficult IEEE Trans Neural Netw 5 157-166
[3]  
Rotman D(2021)Is space-time attention all you need for video understanding In ICML 2 4-3388
[4]  
Bronstein A(2018)Predicting visual features from text for image and video caption retrieval IEEE Trans Multimedia 20 3377-10761
[5]  
Bengio Y(2020)Person tube retrieval via language description Proc AAAI Conf Artif Intell 34 10754-22618
[6]  
Simard P(2020)Coot: cooperative hierarchical transformer for video-text representation learning Adv Neural Inf Process Syst 33 22605-1780
[7]  
Frasconi P(1997)Long short-term memory Neural Comput 9 1735-90
[8]  
Bertasius G(2012)Imagenet classification with deep convolutional neural networks Adv Neural Inf Process Syst 25 84-13949
[9]  
Wang H(2021)Dynamicvit: efficient vision transformers with dynamic token sparsification Advances in neural information processing systems 34 13937-120
[10]  
Torresani L(2017)Movie description Int J Comput Vis 123 94-6010