Modality correlation-based video summarization

被引:0
作者
Xingrun Wang
Xiushan Nie
Xingbo Liu
Binze Wang
Yilong Yin
机构
[1] Shandong University,School of Computer Science and Technology
[2] Shandong Jianzhu University,School of Computer Science and Technology
[3] Chang’an University,College of Geology Engineering and Geomatics
[4] Shandong University,School of Software Engineering
来源
Multimedia Tools and Applications | 2020年 / 79卷
关键词
Video summarization; Modality correlation; Modality-specific information; Attention mechanism;
D O I
暂无
中图分类号
学科分类号
摘要
Video summarization is an important technique to help us browse, store, and retrieve a rapidly increasing amount of video data, which extracts frames or shots from the original video. Text information covers important content of a video, and thus a summarization can be generated by exploring the correlation between the frame and text. In this study, we propose a video summarization method based on the modality correlation. With this method, we first learn the correlation between the text and frame in the respective space, and then fuse two correlations to obtain the importance score of each shot. Finally, video shots that have a high importance score are chosen as the video summarization. Compared to previous methods that seldom apply text to generate the video summarization, or only use the latent common information between text and frame, the proposed method fully utilizes not only the latent common but also modality-specific information for a video summarization. Experiments were conducted on the TVSum50 dataset, and the results verify the effectiveness of our proposed approach.
引用
收藏
页码:33875 / 33890
页数:15
相关论文
共 83 条
  • [1] Cong Y(2012)Towards scalable summarization of consumer videos via sparse dictionary selection IEEE Trans Multimed 14 66-75
  • [2] Yuan J(2011)Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method Pattern Recogn Lett 32 56-68
  • [3] Luo J(2013)Keypoint-based keyframe selection IEEE Trans Circ Sys Video Technol 23 729-734
  • [4] De Avila SEF(2013)Representing and retrieving video shots in human-centric brain imaging space IEEE Trans Image Process 22 2723-2736
  • [5] Lopes APB(1997)Long short-term memory Neural Comput 9 1735-1780
  • [6] da Luz AJr(2013)Video key frame extraction through dynamic delaunay clustering with a structural constraint J Vis Commun Image Represent 24 1212-1227
  • [7] de Albuquerque Araújo A(2018)Heterogeneous domain adaptation through progressive alignment IEEE Trans Neural Netw Learning Sys 30 1381-1391
  • [8] Guan G(2018)Transfer independently together: a generalized framework for domain adaptation IEEE Trans Cybern 49 2144-2155
  • [9] Wang Z(2016)Low-rank discriminant embedding for multiview learning IEEE Trans Cybern 47 3516-3529
  • [10] Lu S(2017)A general framework for edited video and raw video summarization IEEE Trans Image Process 26 3652-3664