Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior

被引：45

作者：

Cai, Sijia ^{[1
,2
]}

Zuo, Wangmeng ^{[3
]}

Davis, Larry S. ^{[4
]}

Zhang, Lei ^{[1
]}

机构：

[1] Hong Kong Polytech Univ, Dept Comp, Kowloon, Hong Kong, Peoples R China

[2] DAMO Acad, Alibaba Grp, Hangzhou, Peoples R China

[3] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Peoples R China

[4] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA

来源：

COMPUTER VISION - ECCV 2018, PT XIV | 2018年 / 11218卷

关键词：

Video summarization; Variational autoencoder;

D O I：

10.1007/978-3-030-01264-9_12

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video summarization is a challenging under-constrained problem because the underlying summary of a single video strongly depends on users' subjective understandings. Data-driven approaches, such as deep neural networks, can deal with the ambiguity inherent in this task to some extent, but it is extremely expensive to acquire the temporal annotations of a large-scale video dataset. To leverage the plentiful web-crawled videos to improve the performance of video summarization, we present a generative modelling framework to learn the latent semantic video representations to bridge the benchmark data and web data. Specifically, our framework couples two important components: a variational autoencoder for learning the latent semantics from web videos, and an encoder-attention-decoder for saliency estimation of raw video and summary generation. A loss term to learn the semantic matching between the generated summaries and web videos is presented, and the overall framework is further formulated into a unified conditional variational encoder-decoder, called variational encoder-summarizer-decoder (VESD). Experiments conducted on the challenging datasets CoSum and TVSum demonstrate the superior performance of the proposed VESD to existing state-of-the-art methods. The source code of this work can be found at https://github.com/cssjcai/vesd.

引用

页码：193 / 210

页数：18

共 45 条

[1]

Chu WS, 2015, PROC CVPR IEEE, P3584, DOI 10.1109/CVPR.2015.7298981

[2]

Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878

[3] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[4]

Elhamifar E, 2012, PROC CVPR IEEE, P1600, DOI 10.1109/CVPR.2012.6247852

[5]

Feng SK, 2012, PROC CVPR IEEE, P2082, DOI 10.1109/CVPR.2012.6247913

[6]

Gong BQ, 2014, ADV NEUR IN, V27

[7] Generative Adversarial Networks [J].

Goodfellow, Ian ;

Pouget-Abadie, Jean ;

Mirza, Mehdi ;

Xu, Bing ;

Warde-Farley, David ;

Ozair, Sherjil ;

Courville, Aaron ;

Bengio, Yoshua .

COMMUNICATIONS OF THE ACM, 2020, 63 (11) :139-144

[8] A Top-Down Approach for Video Summarization [J].

Guan, Genliang ;

Wang, Zhiyong ;

Mei, Shaohui ;

Ott, Max ;

He, Mingyi ;

Feng, David Dagan .

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2014, 11 (01)

[9] Video2GIF: Automatic Generation of Animated GIFs from Video [J].

Gygli, Michael ;

Song, Yale ;

Cao, Liangliang .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1001-1009

[10]

Gygli M, 2015, PROC CVPR IEEE, P3090, DOI 10.1109/CVPR.2015.7298928

← 1 2 3 4 5 →