Attention-Based Bidirectional Recurrent Neural Networks for Description Generation of Videos

被引:0
作者
Du, Xiaotong [1 ]
Yuan, Jiabin [1 ]
Liu, Hu [1 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing, Jiangsu, Peoples R China
来源
CLOUD COMPUTING AND SECURITY, PT VI | 2018年 / 11068卷
关键词
Video description; Convolutional Neural Networks; Bidirectional Recurrent Neural Networks; Attention mechanism;
D O I
10.1007/978-3-030-00021-9_40
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Describing videos in human language is of vital importance in many applications, such as managing massive videos on line and providing descriptive video service (DVS) for blind people. In order to further promote existing video description frameworks, this paper presents an end-to-end deep learning model incorporating Convolutional Neural Networks (CNNs) and Bidirectional Recurrent Neural Networks (BiRNNs) based on a multimodal attention mechanism. Firstly, the model produces richer video representations, including image feature, motion feature and audio feature, than other similar researches. Secondly, BiRNNs model encodes these features in both forward and backward directions. Finally, an attention-based decoder translates sequential outputs of encoder to sequential words. The model is evaluated on Microsoft Research Video Description Corpus (MSVD) dataset. The results demonstrate the necessity of combining BiRNNs with a multimodal attention mechanism and the superiority of this model over other state-of-the-art methods conducted on this dataset.
引用
收藏
页码:440 / 451
页数:12
相关论文
共 28 条
  • [11] Cho K., 2014, ARXIV, DOI 10.3115/v1/w14-4012
  • [12] d'Angelo E., 2011, 2011 18th IEEE International Conference on Image Processing (ICIP 2011), P1885, DOI 10.1109/ICIP.2011.6115836
  • [13] Denkowski M. J., 2014, P 9 WORKSHOP STAT MA, P376
  • [14] pyAudioAnalysis: An Open-Source Python']Python Library for Audio Signal Analysis
    Giannakopoulos, Theodoros
    [J]. PLOS ONE, 2015, 10 (12):
  • [15] HE KM, 2016, PROC CVPR IEEE, P770, DOI [10.1109/CVPR.2016.90, DOI 10.1109/CVPR.2016.90]
  • [16] Hershey S., 2017, IEEE INT C AC SPEECH
  • [17] Jin Q, 2016, P 24 ACM INT C MULT, P1087, DOI [DOI 10.1145/2964284.2984065, 10.1145/2964284.2984065]
  • [18] BabyTalk: Understanding and Generating Simple Image Descriptions
    Kulkarni, Girish
    Premraj, Visruth
    Ordonez, Vicente
    Dhar, Sagnik
    Li, Siming
    Choi, Yejin
    Berg, Alexander C.
    Berg, Tamara L.
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (12) : 2891 - 2903
  • [19] BLEU: a method for automatic evaluation of machine translation
    Papineni, K
    Roukos, S
    Ward, T
    Zhu, WJ
    [J]. 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2002, : 311 - 318
  • [20] Multi-Task Video Captioning with Video and Entailment Generation
    Pasunuru, Ramakanth
    Bansal, Mohit
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1273 - 1283