Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior

被引:45
作者
Cai, Sijia [1 ,2 ]
Zuo, Wangmeng [3 ]
Davis, Larry S. [4 ]
Zhang, Lei [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Kowloon, Hong Kong, Peoples R China
[2] DAMO Acad, Alibaba Grp, Hangzhou, Peoples R China
[3] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Peoples R China
[4] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
来源
COMPUTER VISION - ECCV 2018, PT XIV | 2018年 / 11218卷
关键词
Video summarization; Variational autoencoder;
D O I
10.1007/978-3-030-01264-9_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video summarization is a challenging under-constrained problem because the underlying summary of a single video strongly depends on users' subjective understandings. Data-driven approaches, such as deep neural networks, can deal with the ambiguity inherent in this task to some extent, but it is extremely expensive to acquire the temporal annotations of a large-scale video dataset. To leverage the plentiful web-crawled videos to improve the performance of video summarization, we present a generative modelling framework to learn the latent semantic video representations to bridge the benchmark data and web data. Specifically, our framework couples two important components: a variational autoencoder for learning the latent semantics from web videos, and an encoder-attention-decoder for saliency estimation of raw video and summary generation. A loss term to learn the semantic matching between the generated summaries and web videos is presented, and the overall framework is further formulated into a unified conditional variational encoder-decoder, called variational encoder-summarizer-decoder (VESD). Experiments conducted on the challenging datasets CoSum and TVSum demonstrate the superior performance of the proposed VESD to existing state-of-the-art methods. The source code of this work can be found at https://github.com/cssjcai/vesd.
引用
收藏
页码:193 / 210
页数:18
相关论文
共 45 条
[1]  
Chu WS, 2015, PROC CVPR IEEE, P3584, DOI 10.1109/CVPR.2015.7298981
[2]  
Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
[3]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[4]  
Elhamifar E, 2012, PROC CVPR IEEE, P1600, DOI 10.1109/CVPR.2012.6247852
[5]  
Feng SK, 2012, PROC CVPR IEEE, P2082, DOI 10.1109/CVPR.2012.6247913
[6]  
Gong BQ, 2014, ADV NEUR IN, V27
[7]   Generative Adversarial Networks [J].
Goodfellow, Ian ;
Pouget-Abadie, Jean ;
Mirza, Mehdi ;
Xu, Bing ;
Warde-Farley, David ;
Ozair, Sherjil ;
Courville, Aaron ;
Bengio, Yoshua .
COMMUNICATIONS OF THE ACM, 2020, 63 (11) :139-144
[8]   A Top-Down Approach for Video Summarization [J].
Guan, Genliang ;
Wang, Zhiyong ;
Mei, Shaohui ;
Ott, Max ;
He, Mingyi ;
Feng, David Dagan .
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2014, 11 (01)
[9]   Video2GIF: Automatic Generation of Animated GIFs from Video [J].
Gygli, Michael ;
Song, Yale ;
Cao, Liangliang .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1001-1009
[10]  
Gygli M, 2015, PROC CVPR IEEE, P3090, DOI 10.1109/CVPR.2015.7298928