Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization

被引：1

作者：

Pang, Zongshang ^{[1
]}

Nakashima, Yuta ^{[1
]}

Otani, Mayu ^{[2
]}

Nagahara, Hajime ^{[1
]}

机构：

[1] Osaka Univ, Intelligence & Sensing Lab, Suita 5650871, Japan

[2] CyberAgent Inc, Tokyo 1500042, Japan

来源：

JOURNAL OF IMAGING | 2024年 / 10卷 / 09期

关键词：

video summarization; contrastive learning; visual pre-training;

D O I：

10.3390/jimaging10090229

中图分类号：

TB8 [摄影技术];

学科分类号：

0804 ;

摘要：

Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Past efforts have invariantly involved training summarization models with annotated summaries or heuristic objectives. In this work, we reveal that features pre-trained on image-level tasks contain rich semantic information that can be readily leveraged to quantify frame-level importance for zero-shot video summarization. Leveraging pre-trained features and contrastive learning, we propose three metrics featuring a desirable keyframe: local dissimilarity, global consistency, and uniqueness. We show that the metrics can well-capture the diversity and representativeness of frames commonly used for the unsupervised generation of video summaries, demonstrating competitive or better performance compared to past methods when no training is needed. We further propose a contrastive learning-based pre-training strategy on unlabeled videos to enhance the quality of the proposed metrics and, thus, improve the evaluated performance on the public benchmarks TVSum and SumMe.

引用

页数：20

共 82 条

[61]

Takahashi Y, 2005, 2005 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), VOLS 1 AND 2, P1171

[62] Contrastive Multiview Coding [J].

Tian, Yonglong ;

Krishnan, Dilip ;

Isola, Phillip .

COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :776-794

[63] Integrating highlights for more complete sports video summarization [J].

Tjondronegoro, D ;

Chen, YPP ;

Pham, B .

IEEE MULTIMEDIA, 2004, 11 (04) :22-37

[64] Scene-Based Movie Summarization Via Role-Community Networks [J].

Tsai, Chia-Ming ;

Kang, Li-Wei ;

Lin, Chia-Wen ;

Lin, Weisi .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2013, 23 (11) :1927-1940

[65]

van den Oord A, 2019, Arxiv, DOI arXiv:1807.03748

[66]

Vaswani A, 2017, ADV NEUR IN, V30

[67] Understanding the Behaviour of Contrastive Loss [J].

Wang, Feng ;

Liu, Huaping .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :2495-2504

[68]

Wang J., 2019, P ACM INT C MULT ACM

[69]

Wang T., 2020, P INT C MACH LEARN I, P1

[70] Dense Contrastive Learning for Self-Supervised Visual Pre-Training [J].

Wang, Xinlong ;

Zhang, Rufeng ;

Shen, Chunhua ;

Kong, Tao ;

Li, Lei .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3023-3032

← 1 2 3 4 5 6 7 8 9 →