Stacked Memory Network for Video Summarization

被引：36

作者：

Wang, Junbo ^{[1
,2
]}

Wang, Wei ^{[1
,2
]}

Wang, Zhiyong ^{[3
]}

Wang, Liang ^{[1
,2
]}

Feng, Dagan ^{[3
]}

Tan, Tieniu ^{[1
,2
]}

机构：

[1] Chinese Acad Sci CASIA, Inst Automat, NLPR, CRIPAC, Beijing, Peoples R China

[2] UCAS, Beijing, Peoples R China

[3] Univ Sydney, Sch Comp Sci, Sydney, NSW, Australia

来源：

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19) | 2019年

基金：

中国国家自然科学基金;

关键词：

video summarization; stacked memory network; temporal dependency; recurrent neural network;

D O I：

10.1145/3343031.3350992

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

In recent years, supervised video summarization has achieved promising progress with various recurrent neural networks (RNNs) based methods, which treats video summarization as a sequence-to-sequence learning problem to exploit temporal dependency among video frames across variable ranges. However, RNN has limitations in modelling the long-term temporal dependency for summarizing videos with thousands of frames due to the restricted memory storage unit. Therefore, in this paper we propose a stacked memory network called SMN to explicitly model the long dependency among video frames so that redundancy could be minimized in the video summaries produced. Our proposed SMN consists of two key components: Long Short-Term Memory (LSTM) layer and memory layer, where each LSTM layer is augmented with an external memory layer. In particular, we stack multiple LSTM layers and memory layers hierarchically to integrate the learned representation from prior layers. By combining the hidden states of the LSTM layers and the read representations of the memory layers, our SMN is able to derive more accurate video summaries for individual video frames. Compared with the existing RNN based methods, our SMN is particularly good at capturing long temporal dependency among frames with few additional training parameters. Experimental results on two widely used public benchmark datasets: SumMe and TVsum, demonstrate that our proposed model is able to clearly outperform a number of state-of-the-art ones under various settings.

引用

页码：836 / 844

页数：9

共 42 条

[1] [Anonymous], 2018, Market concentration, market shares, and Retail food prices: evidence from the U. S. Women, infants, and children program 0, DOI DOI 10.1093/AEPP/PPY016
[2] [Anonymous], 37 MIND BLOW YOUTUBE
[3] [Anonymous], CVPR
[4] [Anonymous], 2018, PROC EAR C COMPUT VI
[5] [Anonymous], 2015, PROC CVPR IEEE
[6] [Anonymous], ARXIV180510538
[7] [Anonymous], 2014, CVPR, DOI DOI 10.1109/CVPR.2014.322
[8] [Anonymous], ADV NEURAL INFORM PR
[9] [Anonymous], 2015, PROCIEEE CONFCOMPUT, DOI DOI 10.1109/CVPR.2015.7298594
[10] [Anonymous], CVPR

← 1 2 3 4 5 →