Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior

被引:41
作者
Cai, Sijia [1 ,2 ]
Zuo, Wangmeng [3 ]
Davis, Larry S. [4 ]
Zhang, Lei [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Kowloon, Hong Kong, Peoples R China
[2] DAMO Acad, Alibaba Grp, Hangzhou, Peoples R China
[3] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Peoples R China
[4] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
来源
COMPUTER VISION - ECCV 2018, PT XIV | 2018年 / 11218卷
关键词
Video summarization; Variational autoencoder;
D O I
10.1007/978-3-030-01264-9_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video summarization is a challenging under-constrained problem because the underlying summary of a single video strongly depends on users' subjective understandings. Data-driven approaches, such as deep neural networks, can deal with the ambiguity inherent in this task to some extent, but it is extremely expensive to acquire the temporal annotations of a large-scale video dataset. To leverage the plentiful web-crawled videos to improve the performance of video summarization, we present a generative modelling framework to learn the latent semantic video representations to bridge the benchmark data and web data. Specifically, our framework couples two important components: a variational autoencoder for learning the latent semantics from web videos, and an encoder-attention-decoder for saliency estimation of raw video and summary generation. A loss term to learn the semantic matching between the generated summaries and web videos is presented, and the overall framework is further formulated into a unified conditional variational encoder-decoder, called variational encoder-summarizer-decoder (VESD). Experiments conducted on the challenging datasets CoSum and TVSum demonstrate the superior performance of the proposed VESD to existing state-of-the-art methods. The source code of this work can be found at https://github.com/cssjcai/vesd.
引用
收藏
页码:193 / 210
页数:18
相关论文
共 45 条
  • [1] Chu WS, 2015, PROC CVPR IEEE, P3584, DOI 10.1109/CVPR.2015.7298981
  • [2] Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
  • [3] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [4] Elhamifar E, 2012, PROC CVPR IEEE, P1600, DOI 10.1109/CVPR.2012.6247852
  • [5] Feng SK, 2012, PROC CVPR IEEE, P2082, DOI 10.1109/CVPR.2012.6247913
  • [6] Gong BQ, 2014, ADV NEUR IN, V27
  • [7] Generative Adversarial Networks
    Goodfellow, Ian
    Pouget-Abadie, Jean
    Mirza, Mehdi
    Xu, Bing
    Warde-Farley, David
    Ozair, Sherjil
    Courville, Aaron
    Bengio, Yoshua
    [J]. COMMUNICATIONS OF THE ACM, 2020, 63 (11) : 139 - 144
  • [8] A Top-Down Approach for Video Summarization
    Guan, Genliang
    Wang, Zhiyong
    Mei, Shaohui
    Ott, Max
    He, Mingyi
    Feng, David Dagan
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2014, 11 (01)
  • [9] Video2GIF: Automatic Generation of Animated GIFs from Video
    Gygli, Michael
    Song, Yale
    Cao, Liangliang
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1001 - 1009
  • [10] Gygli M, 2015, PROC CVPR IEEE, P3090, DOI 10.1109/CVPR.2015.7298928