Text-video retrieval method based on enhanced self-attention and multi-task learning

被引：0

作者：

Xiaoyu Wu

Jiayao Qian

Tiantian Wang

机构：

[1] Communication University of China,State Key Laboratory of Media Convergence and Communication

来源：

Multimedia Tools and Applications | 2023年 / 82卷

关键词：

Text-video retrieval; Self-attention; Multi-task learning; Semantic space;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The explosive growth of videos on the Internet makes it a great challenge to use texts to retrieve the videos we need. The general method of text-video retrieval is to project them into a common semantic space to calculate the similarity score. The key technologies of a retrieval model are how to get strong feature representations of text and video and bridge the semantic gap between the two modalities. Moreover, most existing methods do not consider the strong consistency of text-video positive sample pairs. Considering the above problems, we proposed a text-video retrieval method based on enhanced self-attention and multi-task learning in this paper. Firstly, while encoding, the extracted text feature vectors and the extracted video feature vectors are input into Transformer based on enhanced self-attention mechanism for encoding and fusion. Then the text representations and video representations are projected into a common semantic space. Finally, by introducing multi-task learning in the common semantic space, our proposed approach combines the semantic similarity measurement task and the semantic consistency judgement task to optimize the common space through semantic consistency constraints. Our method obtains better retrieval performance on the MSR-Video to Text (MSRVTT), Large Scale Movie Description Challenge (LSMDC), and ActivityNet datasets than some existing approaches, which proves the effectiveness of our proposed strategies.

引用

页码：24387 / 24406

页数：19

共 11 条

[1] Dong J(2018)Predicting visual features from text for image and video caption retrieval IEEE Trans Multimedia 20 3377-3388
[2] Li X(2017)Movie description Int J Comput Vis 123 94-120
[3] Snoek CG(undefined)undefined undefined undefined undefined-undefined
[4] Rohrbach A(undefined)undefined undefined undefined undefined-undefined
[5] Torabi A(undefined)undefined undefined undefined undefined-undefined
[6] Rohrbach M(undefined)undefined undefined undefined undefined-undefined
[7] Tandon N(undefined)undefined undefined undefined undefined-undefined
[8] Pal C(undefined)undefined undefined undefined undefined-undefined
[9] Larochelle H(undefined)undefined undefined undefined undefined-undefined
[10] Courville A(undefined)undefined undefined undefined undefined-undefined

← 1 2 →