MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection

被引:37
作者
Hong, Fa-Ting [1 ,4 ,5 ]
Huang, Xuanteng [1 ]
Li, Wei-Hong [3 ]
Zheng, Wei-Shi [1 ,2 ,5 ]
机构
[1] Sun Yat Sen Univ, Sch Data & Comp Sci, Guangzhou, Peoples R China
[2] Peng Cheng Lab, Shenzhen 518005, Peoples R China
[3] Univ Edinburgh, VICO Grp, Edinburgh, Scotland
[4] Pazhou Lab, Guangzhou, Peoples R China
[5] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Peoples R China
来源
COMPUTER VISION - ECCV 2020, PT XIII | 2020年 / 12358卷
关键词
D O I
10.1007/978-3-030-58601-0_21
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We address the weakly supervised video highlight detection problem for learning to detect segments that are more attractive in training videos given their video event label but without expensive supervision of manually annotating highlight segments. While manually averting localizing highlight segments, weakly supervised modeling is challenging, as a video in our daily life could contain highlight segments with multiple event types, e.g., skiing and surfing. In this work, we propose casting weakly supervised video highlight detection modeling for a given specific event as a multiple instance ranking network (MINI-Net) learning. We consider each video as a bag of segments, and therefore, the proposed MINI-Net learns to enforce a higher highlight score for a positive bag that contains highlight segments of a specific event than those for negative bags that are irrelevant. In particular, we form a max-max ranking loss to acquire a reliable relative comparison between the most likely positive segment instance and the hardest negative segment instance. With this max-max ranking loss, our MINI-Net effectively leverages all segment information to acquire a more distinct video feature representation for localizing the highlight segments of a specific event in a video. The extensive experimental results on three challenging public benchmarks clearly validate the efficacy of our multiple instance ranking approach for solving the problem.
引用
收藏
页码:345 / 360
页数:16
相关论文
共 33 条
[1]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617
[2]   Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior [J].
Cai, Sijia ;
Zuo, Wangmeng ;
Davis, Larry S. ;
Zhang, Lei .
COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :193-210
[3]   Multiple instance learning: A survey of problem characteristics and applications [J].
Carbonneau, Marc-Andre ;
Cheplygina, Veronika ;
Granger, Eric ;
Gagnon, Ghyslain .
PATTERN RECOGNITION, 2018, 77 :329-353
[4]  
Chu WS, 2015, PROC CVPR IEEE, P3584, DOI 10.1109/CVPR.2015.7298981
[5]   Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning [J].
Cinbis, Ramazan Gokberk ;
Verbeek, Jakob ;
Schmid, Cordelia .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (01) :189-203
[6]  
Elhamifar E, 2012, PROC CVPR IEEE, P1600, DOI 10.1109/CVPR.2012.6247852
[7]  
Gong BQ, 2014, ADV NEUR IN, V27
[8]   Video2GIF: Automatic Generation of Animated GIFs from Video [J].
Gygli, Michael ;
Song, Yale ;
Cao, Liangliang .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1001-1009
[9]  
Gygli M, 2015, PROC CVPR IEEE, P3090, DOI 10.1109/CVPR.2015.7298928
[10]   Attention-Based Multimodal Fusion for Video Description [J].
Hori, Chiori ;
Hori, Takaaki ;
Lee, Teng-Yok ;
Zhang, Ziming ;
Harsham, Bret ;
Hershey, John R. ;
Marks, Tim K. ;
Sumi, Kazuhiko .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4203-4212