Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization

被引:1
作者
Pang, Zongshang [1 ]
Nakashima, Yuta [1 ]
Otani, Mayu [2 ]
Nagahara, Hajime [1 ]
机构
[1] Osaka Univ, Intelligence & Sensing Lab, Suita 5650871, Japan
[2] CyberAgent Inc, Tokyo 1500042, Japan
关键词
video summarization; contrastive learning; visual pre-training;
D O I
10.3390/jimaging10090229
中图分类号
TB8 [摄影技术];
学科分类号
0804 ;
摘要
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Past efforts have invariantly involved training summarization models with annotated summaries or heuristic objectives. In this work, we reveal that features pre-trained on image-level tasks contain rich semantic information that can be readily leveraged to quantify frame-level importance for zero-shot video summarization. Leveraging pre-trained features and contrastive learning, we propose three metrics featuring a desirable keyframe: local dissimilarity, global consistency, and uniqueness. We show that the metrics can well-capture the diversity and representativeness of frames commonly used for the unsupervised generation of video summaries, demonstrating competitive or better performance compared to past methods when no training is needed. We further propose a contrastive learning-based pre-training strategy on unlabeled videos to enhance the quality of the proposed metrics and, thus, improve the evaluated performance on the public benchmarks TVSum and SumMe.
引用
收藏
页数:20
相关论文
共 82 条
[21]   Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [J].
Hara, Kensho ;
Kataoka, Hirokatsu ;
Satoh, Yutaka .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6546-6555
[22]   Align and Attend: Multimodal Summarization with Dual Contrastive Losses [J].
He, Bo ;
Wang, Jun ;
Qiu, Jielin ;
Bui, Trung ;
Shrivastava, Abhinav ;
Wang, Zhaowen .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :14867-14878
[23]   Masked Autoencoders Are Scalable Vision Learners [J].
He, Kaiming ;
Chen, Xinlei ;
Xie, Saining ;
Li, Yanghao ;
Dollar, Piotr ;
Girshick, Ross .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15979-15988
[24]   Momentum Contrast for Unsupervised Visual Representation Learning [J].
He, Kaiming ;
Fan, Haoqi ;
Wu, Yuxin ;
Xie, Saining ;
Girshick, Ross .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9726-9735
[25]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[26]  
He X., 2019, P ACM INT C MULT ACM
[27]   Deep attentive and semantic preserving video summarization [J].
Ji, Zhong ;
Jiao, Fang ;
Pang, Yanwei ;
Shao, Ling .
NEUROCOMPUTING, 2020, 405 :200-207
[28]   Video Summarization With Attention-Based Encoder-Decoder Networks [J].
Ji, Zhong ;
Xiong, Kailin ;
Pang, Yanwei ;
Li, Xuelong .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (06) :1709-1717
[29]  
Jung Y., 2020, P EUR C COMP VIS ECC
[30]  
Jung Y., 2019, P C ART INT AAAI HON