Learning Pixel-Level Distinctions for Video Highlight Detection

被引:10
作者
Wei, Fanyue [1 ,2 ,3 ]
Wang, Biao [3 ]
Ge, Tiezheng [3 ]
Jiang, Yuning [3 ]
Li, Wen [1 ,2 ]
Duan, Lixin [1 ,2 ]
机构
[1] UESTC, Sch Comp Sci & Engn, Chengdu, Sichuan, Peoples R China
[2] UESTC, Shenzhen Inst Adv Study, Chengdu, Sichuan, Peoples R China
[3] Alibaba Grp, Hangzhou, Peoples R China
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
RANKING;
D O I
10.1109/CVPR52688.2022.00308
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The goal of video highlight detection is to select the most attractive segments from a long video to depict the most interesting parts of the video. Existing methods typically focus on modeling relationship between different video segments in order to learning a model that can assign highlight scores to these segments; however, these approaches do not explicitly consider the contextual dependency within individual segments. To this end, we propose to learn pixel-level distinctions to improve the video highlight detection. This pixel-level distinction indicates whether or not each pixel in one video belongs to an interesting section. The advantages of modeling such fine-level distinctions are two-fold. First, it allows us to exploit the temporal and spatial relations of the content in one video, since the distinction of a pixel in one frame is highly dependent on both the content before this frame and the content around this pixel in this frame. Second, learning the pixel-level distinction also gives a good explanation to the video highlight task regarding what contents in a highlight segment will be attractive to people. We design an encoder-decoder network to estimate the pixel-level distinction, in which we leverage the 3D convolutional neural networks to exploit the temporal context information, and further take advantage of the visual saliency to model the spatial distinction. State-of-the-art performance on three public benchmarks clearly validates the effectiveness of our framework for video highlight detection.
引用
收藏
页码:3063 / 3072
页数:10
相关论文
共 45 条
  • [1] [Anonymous], 2016, CVPR, DOI DOI 10.1109/CVPR.2016.114
  • [2] [Anonymous], 2020, CVPR, DOI DOI 10.1109/CVPR42600.2020.00482
  • [3] [Anonymous], 2015, CVPR
  • [4] [Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-01237-3_24
  • [5] [Anonymous], 2016, CVPR, DOI DOI 10.1109/CVPR.2016.112
  • [6] Badamdorj T., 2021, P IEEECVF INT C COMP, P8127
  • [7] Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction
    Bak, Cagdas
    Kocak, Aysun
    Erdem, Erkut
    Erdem, Aykut
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (07) : 1688 - 1698
  • [8] Bazzani L., 2017, INT C LEARNING REPRE
  • [9] Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior
    Cai, Sijia
    Zuo, Wangmeng
    Davis, Larry S.
    Zhang, Lei
    [J]. COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 193 - 210
  • [10] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497