An Unsupervised Video Summarization Method Based on Multimodal Representation

被引：0

作者：

Lei, Zhuo ^{[1
,2
]}

Yu, Qiang ^{[1
]}

Shou, Lidan ^{[2
]}

Li, Shengquan ^{[1
]}

Mao, Yunqing ^{[1
]}

机构：

[1] City Cloud Technol China Co Ltd, Hangzhou, Peoples R China

[2] Zhejiang Univ, Hangzhou, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT V | 2023年 / 14090卷

关键词：

Video Summarization; Multi-modal Representation Learning; Unsupervised Learning;

D O I：

10.1007/978-981-99-4761-4_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A good video summary should convey the whole story and feature the most important content. However, the importance of video content is often subjective, and users should have the option to personalize the summary by using natural language to specify what is important to them. Moreover, existing methods usually apply only visual cues to solve generic video summarization tasks, while this work introduces a single unsupervised multi-modal framework for addressing both generic and query-focused video summarization. We use a multi-head attention model to represent the multi-modal feature. We apply a Transformer-based model to learn the frame scores based on their representative, diversity and reconstruction losses. Especially, we develop a novel representative loss to train the model based on both visual and semantic information. We outperform previous state-of-the-art work with superior results on both generic and query-focused video summarization datasets.

引用

页码：171 / 180

页数：10

共 31 条

[1] Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior [J].

Cai, Sijia ;

Zuo, Wangmeng ;

Davis, Larry S. ;

Zhang, Lei .

COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :193-210

[2] VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method [J].

Fontes de Avila, Sandra Eliza ;

Brandao Lopes, Ana Paula ;

da Luz, Antonio, Jr. ;

Araujo, Arnaldo de Albuquerque .

PATTERN RECOGNITION LETTERS, 2011, 32 (01) :56-68

[3]

Gong B., 2014, Advances in Neural Information Processing Systems, P2069

[4] Creating Summaries from User Videos [J].

Gygli, Michael ;

Grabner, Helmut ;

Riemenschneider, Hayko ;

Van Gool, Luc .

COMPUTER VISION - ECCV 2014, PT VII, 2014, 8695 :505-520

[5] Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks [J].

He, Xufeng ;

Hua, Yang ;

Song, Tao ;

Zhang, Zongpu ;

Xue, Zhengui ;

Ma, Ruhui ;

Robertson, Neil ;

Guan, Haibing .

PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, :2296-2304

[6]

Iashin V., 2020, BMVC 2020

[7]

Jungin Park, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12370), P647, DOI 10.1007/978-3-030-58595-2_39

[8]

Lee YJ, 2012, PROC CVPR IEEE, P1346, DOI 10.1109/CVPR.2012.6247820

[9]

Lei J., 2021, NIPS 2021

[10]

Lei Z., 2016, ACM MULT 2016 WORKSH, P45

← 1 2 3 4 →