Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization

被引:0
作者
Pang, Zongshang [1 ]
Nakashima, Yuta [1 ]
Otani, Mayu [2 ]
Nagahara, Hajime [1 ]
机构
[1] Osaka Univ, Intelligence & Sensing Lab, Suita 5650871, Japan
[2] CyberAgent Inc, Tokyo 1500042, Japan
关键词
video summarization; contrastive learning; visual pre-training;
D O I
10.3390/jimaging10090229
中图分类号
TB8 [摄影技术];
学科分类号
0804 ;
摘要
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Past efforts have invariantly involved training summarization models with annotated summaries or heuristic objectives. In this work, we reveal that features pre-trained on image-level tasks contain rich semantic information that can be readily leveraged to quantify frame-level importance for zero-shot video summarization. Leveraging pre-trained features and contrastive learning, we propose three metrics featuring a desirable keyframe: local dissimilarity, global consistency, and uniqueness. We show that the metrics can well-capture the diversity and representativeness of frames commonly used for the unsupervised generation of video summaries, demonstrating competitive or better performance compared to past methods when no training is needed. We further propose a contrastive learning-based pre-training strategy on unlabeled videos to enhance the quality of the proposed metrics and, thus, improve the evaluated performance on the public benchmarks TVSum and SumMe.
引用
收藏
页数:20
相关论文
共 82 条
  • [1] Abu-El-Haija S., 2016, arXiv
  • [2] [Anonymous], 2010, P 18 ACM INT C MULTI
  • [3] Bao H., 2021, arXiv
  • [4] Beyer W.H., 1991, Standard Probability and Statistics: Tables and Formulae
  • [5] Emerging Properties in Self-Supervised Vision Transformers
    Caron, Mathilde
    Touvron, Hugo
    Misra, Ishan
    Jegou, Herve
    Mairal, Julien
    Bojanowski, Piotr
    Joulin, Armand
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9630 - 9640
  • [6] Video Summarization with LSTM and Deep Attention Models
    Casas, Luis Lebron
    Koblents, Eugenia
    [J]. MULTIMEDIA MODELING, MMM 2019, PT II, 2019, 11296 : 67 - 79
  • [7] Chen Y., 2019, P ACM MM AS BEIJ CHI
  • [8] Learning a similarity metric discriminatively, with application to face verification
    Chopra, S
    Hadsell, R
    LeCun, Y
    [J]. 2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, : 539 - 546
  • [9] Summarization of visual content in instructional videos
    Choudary, Chekuri
    Liu, Tiecheng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (07) : 1443 - 1455
  • [10] Spatiotemporal Modeling and Label Distribution Learning for Video Summarization
    Chu, Wei-Ta
    Liu, Yu-Hsin
    [J]. 2019 IEEE 21ST INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP 2019), 2019,