Enhancing the alignment between target words and corresponding frames for video captioning

被引：43

作者：

Tu, Yunbin ^{[1
,2
]}

Zhou, Chang ^{[3
]}

Guo, Junjun ^{[1
,2
]}

Gao, Shengxiang ^{[1
,2
]}

Yu, Zhengtao ^{[1
,2
]}

机构：

[1] Kunming Univ Sci & Technol, Fac Informat Engn & Automat, Kunming 650500, Yunnan, Peoples R China

[2] Kunming Univ Sci & Technol, Yunnan Key Lab Artificial Intelligence, Kunming 650500, Yunnan, Peoples R China

[3] Tsinghua Shenzhen Int Grad Sch, Dept informat Sci & Technol, Shenzhen 518000, Guangdong, Peoples R China

来源：

PATTERN RECOGNITION | 2021年 / 111卷 / 111期

基金：

中国国家自然科学基金;

关键词：

Video captioning; Alignment; Visual tags; Textual-temporal attention; LANGUAGE; VISION;

D O I：

10.1016/j.patcog.2020.107702

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video captioning aims at translating from a sequence of video frames into a sequence of words with the encoder-decoder framework. Hence, it is critical to align these two different sequences. Most existing methods exploit soft-attention (temporal attention) mechanism to align target words with corresponding frames, where the relevance of them merely depends on the previously generated words (i.e., language context). As we know, however, there is an inherent gap between vision and language, and most of the words in a caption belong to non-visual words (e.g. "a", "is", and "in"). Hence, merely with the guidance of the language context, existing temporal attention-based methods cannot exactly align target words with corresponding frames. In order to address this problem, we first introduce pre-detected visual tags from the video to bridge the gap between vision and language. The reason is that visual tags not only belong to textual modality, but also can convey visual information. Then, we present a Textual-Temporal Attention Model (TTA) to exactly align the target words with corresponding frames. The experimental results show that our proposed method outperforms the state-of-the-art methods on two well known datasets, i.e., MSVD and MSR-VTT.(1) (C) 2020 Elsevier Ltd. All rights reserved.

引用

页数：11

共 48 条

[1]

Aafaq N., 2019, CVPR, P12487

[2]

[Anonymous], 2017, P CVPR

[3]

[Anonymous], 2017, CVPR, DOI DOI 10.1515/9783110516623-135

[4]

[Anonymous], 2018, CVPR, DOI DOI 10.1109/CVPR.2018.00685

[5]

[Anonymous], 2018, MICR EN CONV

[6]

[Anonymous], 2019, 33 AAAI C ART INT

[7]

[Anonymous], 2016, PROC CVPR IEEE, DOI DOI 10.1109/CVPR.2016.90

[8]

[Anonymous], ADADELTA: An Adaptive Learning Rate Method

[9]

[Anonymous], 2017, P INT C NEURAL INFOR

[10]

Bahdanau D., 2014, ABS14090473 CORR

← 1 2 3 4 5 →