Learning Pixel-Level Distinctions for Video Highlight Detection

被引：10

作者：

Wei, Fanyue ^{[1
,2
,3
]}

Wang, Biao ^{[3
]}

Ge, Tiezheng ^{[3
]}

Jiang, Yuning ^{[3
]}

Li, Wen ^{[1
,2
]}

Duan, Lixin ^{[1
,2
]}

机构：

[1] UESTC, Sch Comp Sci & Engn, Chengdu, Sichuan, Peoples R China

[2] UESTC, Shenzhen Inst Adv Study, Chengdu, Sichuan, Peoples R China

[3] Alibaba Grp, Hangzhou, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

基金：

北京市自然科学基金; 中国国家自然科学基金;

关键词：

RANKING;

D O I：

10.1109/CVPR52688.2022.00308

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The goal of video highlight detection is to select the most attractive segments from a long video to depict the most interesting parts of the video. Existing methods typically focus on modeling relationship between different video segments in order to learning a model that can assign highlight scores to these segments; however, these approaches do not explicitly consider the contextual dependency within individual segments. To this end, we propose to learn pixel-level distinctions to improve the video highlight detection. This pixel-level distinction indicates whether or not each pixel in one video belongs to an interesting section. The advantages of modeling such fine-level distinctions are two-fold. First, it allows us to exploit the temporal and spatial relations of the content in one video, since the distinction of a pixel in one frame is highly dependent on both the content before this frame and the content around this pixel in this frame. Second, learning the pixel-level distinction also gives a good explanation to the video highlight task regarding what contents in a highlight segment will be attractive to people. We design an encoder-decoder network to estimate the pixel-level distinction, in which we leverage the 3D convolutional neural networks to exploit the temporal context information, and further take advantage of the visual saliency to model the spatial distinction. State-of-the-art performance on three public benchmarks clearly validates the effectiveness of our framework for video highlight detection.

引用

页码：3063 / 3072

页数：10

共 45 条

[1] [Anonymous], 2016, CVPR, DOI DOI 10.1109/CVPR.2016.114
[2] [Anonymous], 2020, CVPR, DOI DOI 10.1109/CVPR42600.2020.00482
[3] [Anonymous], 2015, CVPR
[4] [Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-01237-3_24
[5] [Anonymous], 2016, CVPR, DOI DOI 10.1109/CVPR.2016.112
[6] Badamdorj T., 2021, P IEEECVF INT C COMP, P8127
[7] Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction
Bak, Cagdas
Kocak, Aysun
Erdem, Erkut
Erdem, Aykut
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (07) : 1688 - 1698
[8] Bazzani L., 2017, INT C LEARNING REPRE
[9] Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior
Cai, Sijia
Zuo, Wangmeng
Davis, Larry S.
Zhang, Lei
[J]. COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 193 - 210
[10] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497

← 1 2 3 4 5 →