Deep learning for video-text retrieval: a review

被引：0

作者：

Cunjuan Zhu

Qi Jia

Wei Chen

Yanming Guo

Yu Liu

机构：

[1] Dalian University of Technology,

[2] National University of Defense Technology,undefined

来源：

International Journal of Multimedia Information Retrieval | 2023年 / 12卷

关键词：

Deep learning; Video-text retrieval; Cross-modal representation; Feature matching; Metric learning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. In general, this retrieval task is composed of four successive steps: video and textual feature representation extraction, feature embedding and matching, and objective functions. In the last, a list of samples retrieved from the dataset is ranked based on their matching similarities to the query. In recent years, significant and flourishing progress has been achieved by deep learning techniques, however, VTR is still a challenging task due to the problems like how to learn an efficient spatial-temporal video feature and how to narrow the cross-modal gap. In this survey, we review and summarize over 100 research papers related to VTR, demonstrate state-of-the-art performance on several commonly benchmarked datasets, and discuss potential challenges and directions, with the expectation to provide some insights for researchers in the field of video-text retrieval.

引用

共 58 条

[1]

Amrani E(2021)Noise estimation using density estimation for self-supervised multimodal learning Proc AAAI Conf Artif Intell 35 6644-6652

[2]

Ben-Ari R(1994)Learning long-term dependencies with gradient descent is difficult IEEE Trans Neural Netw 5 157-166

[3]

Rotman D(2021)Is space-time attention all you need for video understanding In ICML 2 4-3388

[4]

Bronstein A(2018)Predicting visual features from text for image and video caption retrieval IEEE Trans Multimedia 20 3377-10761

[5]

Bengio Y(2020)Person tube retrieval via language description Proc AAAI Conf Artif Intell 34 10754-22618

[6]

Simard P(2020)Coot: cooperative hierarchical transformer for video-text representation learning Adv Neural Inf Process Syst 33 22605-1780

[7]

Frasconi P(1997)Long short-term memory Neural Comput 9 1735-90

[8]

Bertasius G(2012)Imagenet classification with deep convolutional neural networks Adv Neural Inf Process Syst 25 84-13949

[9]

Wang H(2021)Dynamicvit: efficient vision transformers with dynamic token sparsification Advances in neural information processing systems 34 13937-120

[10]

Torresani L(2017)Movie description Int J Comput Vis 123 94-6010

← 1 2 3 4 5 6 →