Text-video retrieval method based on enhanced self-attention and multi-task learning

被引:0
作者
Xiaoyu Wu
Jiayao Qian
Tiantian Wang
机构
[1] Communication University of China,State Key Laboratory of Media Convergence and Communication
来源
Multimedia Tools and Applications | 2023年 / 82卷
关键词
Text-video retrieval; Self-attention; Multi-task learning; Semantic space;
D O I
暂无
中图分类号
学科分类号
摘要
The explosive growth of videos on the Internet makes it a great challenge to use texts to retrieve the videos we need. The general method of text-video retrieval is to project them into a common semantic space to calculate the similarity score. The key technologies of a retrieval model are how to get strong feature representations of text and video and bridge the semantic gap between the two modalities. Moreover, most existing methods do not consider the strong consistency of text-video positive sample pairs. Considering the above problems, we proposed a text-video retrieval method based on enhanced self-attention and multi-task learning in this paper. Firstly, while encoding, the extracted text feature vectors and the extracted video feature vectors are input into Transformer based on enhanced self-attention mechanism for encoding and fusion. Then the text representations and video representations are projected into a common semantic space. Finally, by introducing multi-task learning in the common semantic space, our proposed approach combines the semantic similarity measurement task and the semantic consistency judgement task to optimize the common space through semantic consistency constraints. Our method obtains better retrieval performance on the MSR-Video to Text (MSRVTT), Large Scale Movie Description Challenge (LSMDC), and ActivityNet datasets than some existing approaches, which proves the effectiveness of our proposed strategies.
引用
收藏
页码:24387 / 24406
页数:19
相关论文
共 11 条
  • [1] Dong J(2018)Predicting visual features from text for image and video caption retrieval IEEE Trans Multimedia 20 3377-3388
  • [2] Li X(2017)Movie description Int J Comput Vis 123 94-120
  • [3] Snoek CG(undefined)undefined undefined undefined undefined-undefined
  • [4] Rohrbach A(undefined)undefined undefined undefined undefined-undefined
  • [5] Torabi A(undefined)undefined undefined undefined undefined-undefined
  • [6] Rohrbach M(undefined)undefined undefined undefined undefined-undefined
  • [7] Tandon N(undefined)undefined undefined undefined undefined-undefined
  • [8] Pal C(undefined)undefined undefined undefined undefined-undefined
  • [9] Larochelle H(undefined)undefined undefined undefined undefined-undefined
  • [10] Courville A(undefined)undefined undefined undefined undefined-undefined