BidirectionalLong-Short Term Memory for Video Description

被引:44
作者
Bin, Yi [1 ]
Yang, Yang [1 ]
Shen, Fumin [1 ]
Xu, Xing [1 ]
Shen, Heng Tao [1 ,2 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu, Peoples R China
[2] Univ Queensland, Brisbane, Qld, Australia
来源
MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE | 2016年
关键词
Video captioning; bidirectional long-short; term memory;
D O I
10.1145/2964284.2967258
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning has been attracting broad research attention in multimedia community. However, most existing approaches either ignore temporal information among video frames or just employ local contextual temporal knowledge. In this work, we propose a novel video captioning framework, termed as Bidirectional Long-Short Term Memory (BiLSTM), which deeply captures bidirectional global temporal structure in video. Specifically, we first devise a joint visual modelling approach to encode video data by combining a forward LSTM pass, a backward LSTM pass, together with visual features from Convolutional Neural Networks (CNNs). Then, we inject the derived video representation into the subsequent language model for initialization. The benefits are in two folds: 1) comprehensively preserving sequential and visual information; and 2) adaptively learning dense visual features and sparse semantic representations for videos and sentences, respectively. We verify the effectiveness of our proposed video captioning framework on a commonly used benchmark, i.e., Microsoft Video Description (MSVD) corpus, and the experimental results demonstrate that the superiority of the proposed approach as compared to several state-of-the-art methods.
引用
收藏
页码:436 / 440
页数:5
相关论文
共 35 条
  • [1] [Anonymous], 2011, Proceedings of the 19th ACM international conference on Multimedia-MM'11, DOI [10.1145/2072298.2071962, DOI 10.1145/2072298.2071962]
  • [2] [Anonymous], P ACM ICMR
  • [3] Chen D., 2011, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, P190
  • [4] Chen X, 2015, PROC CVPR IEEE, P2422, DOI 10.1109/CVPR.2015.7298856
  • [5] Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
  • [6] Every Picture Tells a Story: Generating Sentences from Images
    Farhadi, Ali
    Hejrati, Mohsen
    Sadeghi, Mohammad Amin
    Young, Peter
    Rashtchian, Cyrus
    Hockenmaier, Julia
    Forsyth, David
    [J]. COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 : 15 - +
  • [7] Gan C, 2015, PROC CVPR IEEE, P2568, DOI 10.1109/CVPR.2015.7298872
  • [8] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]
  • [9] Jia Y., 2014, P 2014 ACM C MULTIME, P675, DOI [DOI 10.1145/2647868.2654889, 10.48550/arXiv.1408.5093, 10.1145/2647868.2654889]
  • [10] Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932