Generating Video Descriptions with Topic Guidance

被引:11
作者
Chen, Shizhe [1 ]
Chen, Jia [2 ]
Jin, Qin [1 ]
机构
[1] Renmin Univ China, Sch Informat, Multimedia Comp Lab, Beijing, Peoples R China
[2] Carnegie Mellon Univ, Language Technol Inst, Sch Comp Sci, Pittsburgh, PA 15213 USA
来源
PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17) | 2017年
关键词
Video Captioning; Data-driven Topics; Multi-modalities; Teacher-student Learning;
D O I
10.1145/3078971.3079000
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Generating video descriptions in natural language (a.k.a. video captioning) is a more challenging task than image captioning as the videos are intrinsically more complicated than images in two aspects. First, videos cover a broader range of topics, such as news, music, sports and so on. Second, multiple topics could coexist in the same video. In this paper, we propose a novel caption model, topic-guided model (TGM), to generate topic-oriented descriptions for videos in the wild via exploiting topic information. In addition to predefined topics, i.e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model. We show that data-driven topics reflect a better topic schema than the predefined topics. As for testing video topic prediction, we treat the topic mining model as teacher to train the student, the topic prediction model, by utilizing the full multi-modalities in the video especially the speech modality. We propose a series of caption models to exploit topic guidance, including implicitly using the topics as input features to generate words related to the topic and explicitly modifying the weights in the decoder with topics to function as an ensemble of topic-aware language decoders. Our comprehensive experimental results on the current largest video caption dataset MSR-VTT prove the effectiveness of our topic-guided model, which significantly surpasses the winning performance in the 2016 MSR video to language challenge.
引用
收藏
页码:5 / 13
页数:9
相关论文
共 37 条
  • [1] [Anonymous], 2016, P 24 ACM INT C MULTI, DOI DOI 10.1145/2964284.2984066
  • [2] [Anonymous], 1997, Neural Computation
  • [3] Ba LJ, 2014, ADV NEUR IN, V27
  • [4] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
  • [5] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [6] A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
    Das, Pradipto
    Xu, Chenliang
    Doell, Richard F.
    Corso, Jason J.
    [J]. 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 2634 - 2641
  • [7] COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES
    DAVIS, SB
    MERMELSTEIN, P
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04): : 357 - 366
  • [8] Denkowski M., 2014, P 9 WORKSH STAT MACH
  • [9] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [10] Semantic Compositional Networks for Visual Captioning
    Gan, Zhe
    Gan, Chuang
    He, Xiaodong
    Pu, Yunchen
    Tran, Kenneth
    Gao, Jianfeng
    Carin, Lawrence
    Deng, Li
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1141 - 1150