DEEP-AD: A Multimodal Temporal Video Segmentation Framework for Online Video Advertising

被引:5
作者
Tapu, Ruxandra [1 ,2 ]
Mocanu, Bogdan [1 ,2 ]
Zaharia, Titus [1 ]
机构
[1] Telecom SudParis, Inst Polytech Paris, Lab SAMOVAR, ARTEMIS Dept, F-91000 Evry, France
[2] Univ Politehn Bucuresti, Fac ETTI, Telecommun Dept, Bucharest 060042, Romania
来源
IEEE ACCESS | 2020年 / 8卷
关键词
Streaming media; Advertising; Visualization; Semantics; Object recognition; TV; Convolutional neural networks; Multimodal temporal video segmentation; thumbnail extraction from video scenes; commercial advertisement insertion based on semantic criterions; deep convolutional neural networks; OBJECT;
D O I
10.1109/ACCESS.2020.2997949
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we introduce the DEEP-AD framework, a multimodal advertisement insertion system dedicated to online video platforms. The framework is designed from the viewer's perspective, in terms of commercial contextual relevance and degree of intrusiveness. The main contribution of the paper concerns a novel multimodal temporal video segmentation algorithm into scenes/stories, which makes it possible to determine automatically the temporal instants that are the most appropriate for inserting advertisement clips. The proposed algorithm exploits various deep convolutional neural networks, involved at several stages. The video stream is first divided into shots based on a graph partition method. The video shots are then clustered into scenes/story units with the help of an agglomerative clustering methodology taking as input visual, audio and semantic features. Furthermore, in order to facilitate the user's access to multimedia documents a novel thumbnail extraction method is proposed based on both semantic representativeness and visual quality information. Finally, the optimal advertisement insertion points are determined based on the ads temporal distribution, commercial diversity and degree of intrusiveness. The experimental results, carried out on a large dataset of more than 30 videos, taken from the French National Television and US TV series validate the proposed methodology with average accuracy and recognition rates superior to 88%. Moreover, when compared with other state of the art methods, the proposed temporal video segmentation yields gains of more than 6% in precision and recall rates.
引用
收藏
页码:99582 / 99597
页数:16
相关论文
共 54 条
[1]  
[Anonymous], 2017, 7 INT CONFIMAGE PROC, DOI DOI 10.1109/IPTA.2017.8310091
[2]  
[Anonymous], PROC CVPR IEEE, DOI DOI 10.1109/CVPR.2016.596
[3]  
Apostolidis Evlampios, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P6583, DOI 10.1109/ICASSP.2014.6854873
[4]   Recognizing and Presenting the Storytelling Video Structure With Deep Multimodal Networks [J].
Baraldi, Lorenzo ;
Grana, Costantino ;
Cucchiara, Rita .
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (05) :955-968
[5]   Shot and Scene Detection via Hierarchical Clustering for Re-using Broadcast Video [J].
Baraldi, Lorenzo ;
Grana, Costantino ;
Cucchiara, Rita .
COMPUTER ANALYSIS OF IMAGES AND PATTERNS, CAIP 2015, PT I, 2015, 9256 :801-811
[6]   SURF: Speeded up robust features [J].
Bay, Herbert ;
Tuytelaars, Tinne ;
Van Gool, Luc .
COMPUTER VISION - ECCV 2006 , PT 1, PROCEEDINGS, 2006, 3951 :404-417
[7]  
Chollet Fran9ois, 2017, PROC CVPR IEEE, P1251, DOI DOI 10.1109/CVPR.2017.195
[8]  
Csurka G., 2004, Workshop on statistical learning in computer vision, ECCV, P1
[9]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[10]  
Doukhan D, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5214, DOI 10.1109/ICASSP.2018.8461471