Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

被引:158
作者
Aafaq, Nayyer [1 ]
Akhtar, Naveed [1 ]
Liu, Wei [1 ]
Gilani, Syed Zulqarnain [1 ]
Mian, Ajmal [1 ]
机构
[1] Univ Western Australia, Comp Sci & Software Engn, Nedlands, WA, Australia
来源
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) | 2019年
关键词
D O I
10.1109/CVPR.2019.01277
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic generation of video captions is a fundamental challenge in computer vision. Recent techniques typically employ a combination of Convolutional Neural Networks (CNNs) and Recursive Neural Networks (RNNs) for video captioning. These methods mainly focus on tailoring sequence learning through RNNs for better caption generation, whereas off-the-shelf visual features are borrowed from CNNs. We argue that careful designing of visual features for this task is equally important, and present a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs). Our method embeds rich temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features of the whole video. It additionally derives high level semantics from an object detector to enrich the representation with spatial dynamics of the detected objects. The final representation is projected to a compact space and fed to a language model. By learning a relatively simple language model comprising two GRU layers, we establish new state-of-the-art on MSVD and MSR-VTT datasets for METEOR and ROUGE(L) metrics.
引用
收藏
页码:12479 / 12488
页数:10
相关论文
共 61 条
  • [1] Aafaq Nayyer, 2018, ARXIV180600186
  • [2] Alayrac J.-B., 2016, IEEE CVPR
  • [3] Andrei B., 2015, WORKSH LANG VIS CVPR
  • [4] Anna R., 2015, JOINT VID LANG UND W
  • [5] [Anonymous], 2016, ARXIV161102261
  • [6] [Anonymous], CVPR WORKSH
  • [7] Ballas N., 2016, ICLR
  • [8] Barbu Andrei., 2012, Proceedings of the Conference on Uncertainty in Artificial Intelligence UAI, P102
  • [9] Biswas P, 2005, I CONF VLSI DESIGN, P651
  • [10] Bojanowski Piotr, 2017, Trans. Assoc. Comput. Linguist., V5, P135, DOI DOI 10.1162/TACL_A_00051