Hierarchical Memory Modelling for Video Captioning

被引：12

作者：

Wang, Junbo ^{[1
,3
]}

Wang, Wei ^{[1
,3
]}

Huang, Yan ^{[1
,3
]}

Wang, Liang ^{[1
,2
,3
]}

Tan, Tieniu ^{[1
,2
,3
]}

机构：

[1] NLPR, Ctr Res Intelligent Percept & Comp CRIPAC, Beijing 100190, Peoples R China

[2] Chinese Acad Sci CASIA, Ctr Excellence Brain Sci & Intelligence Technol C, Inst Automat, Beijing, Peoples R China

[3] UCAS, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年

基金：

中国国家自然科学基金;

关键词：

Visual attention; hierarchical memory model; video captioning;

D O I：

10.1145/3240508.3240538

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Translating videos into natural language sentences has drawn much attention recently. The framework of combining visual attention with Long Short-Term Memory (LSTM) based text decoder has achieved much progress. However, the vision-language translation still remains unsolved due to the semantic gap and misalignment between video content and described semantic concept. In this paper, we propose a Hierarchical Memory Model (HMM) - a novel deep video captioning architecture which unifies a textual memory, a visual memory and an attribute memory in a hierarchical way. These memories can guide attention for efficient video representation extraction and semantic attribute selection in addition to modelling the long-term dependency for video sequence and sentences, respectively. Compared with traditional vision-based text decoder, the proposed attribute-based text decoder can largely reduce the semantic discrepancy between video and sentence. To prove the effectiveness of the proposed model, we perform extensive experiments on two public benchmark datasets: MSVD and MSR-VTT. Experiments show that our model not only can discover appropriate video representation and semantic attributes but also can achieve comparable or superior performances than state-of-the-art methods on these datasets.

引用

页码：63 / 71

页数：9

共 44 条

[1] [Anonymous], 2015, P IEEE C COMP VIS PA
[2] [Anonymous], P ADV NEURAL INFORM
[3] [Anonymous], P 27 AAAI C ART INT
[4] [Anonymous], P ANN M ASS COMP LIN
[5] [Anonymous], P ANN M ASS COMP LIN
[6] [Anonymous], P INT C COMP VIS
[7] [Anonymous], 2016, Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
[8] [Anonymous], 2014, VERY DEEP CONVOLUTIO
[9] [Anonymous], P INT C COMP VIS
[10] [Anonymous], 2016, P IEEE C COMP VIS PA

← 1 2 3 4 5 →