Hierarchical Memory Modelling for Video Captioning

被引:12
作者
Wang, Junbo [1 ,3 ]
Wang, Wei [1 ,3 ]
Huang, Yan [1 ,3 ]
Wang, Liang [1 ,2 ,3 ]
Tan, Tieniu [1 ,2 ,3 ]
机构
[1] NLPR, Ctr Res Intelligent Percept & Comp CRIPAC, Beijing 100190, Peoples R China
[2] Chinese Acad Sci CASIA, Ctr Excellence Brain Sci & Intelligence Technol C, Inst Automat, Beijing, Peoples R China
[3] UCAS, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年
基金
中国国家自然科学基金;
关键词
Visual attention; hierarchical memory model; video captioning;
D O I
10.1145/3240508.3240538
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Translating videos into natural language sentences has drawn much attention recently. The framework of combining visual attention with Long Short-Term Memory (LSTM) based text decoder has achieved much progress. However, the vision-language translation still remains unsolved due to the semantic gap and misalignment between video content and described semantic concept. In this paper, we propose a Hierarchical Memory Model (HMM) - a novel deep video captioning architecture which unifies a textual memory, a visual memory and an attribute memory in a hierarchical way. These memories can guide attention for efficient video representation extraction and semantic attribute selection in addition to modelling the long-term dependency for video sequence and sentences, respectively. Compared with traditional vision-based text decoder, the proposed attribute-based text decoder can largely reduce the semantic discrepancy between video and sentence. To prove the effectiveness of the proposed model, we perform extensive experiments on two public benchmark datasets: MSVD and MSR-VTT. Experiments show that our model not only can discover appropriate video representation and semantic attributes but also can achieve comparable or superior performances than state-of-the-art methods on these datasets.
引用
收藏
页码:63 / 71
页数:9
相关论文
共 44 条
  • [1] [Anonymous], 2015, P IEEE C COMP VIS PA
  • [2] [Anonymous], P ADV NEURAL INFORM
  • [3] [Anonymous], P 27 AAAI C ART INT
  • [4] [Anonymous], P ANN M ASS COMP LIN
  • [5] [Anonymous], P ANN M ASS COMP LIN
  • [6] [Anonymous], P INT C COMP VIS
  • [7] [Anonymous], 2016, Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
  • [8] [Anonymous], 2014, VERY DEEP CONVOLUTIO
  • [9] [Anonymous], P INT C COMP VIS
  • [10] [Anonymous], 2016, P IEEE C COMP VIS PA