Semantic Video Retrieval using Deep Learning Techniques

被引：0

作者：

Yasin, Danish

Sohail, Ashbal

Siddiqi, Imran

机构：

来源：

PROCEEDINGS OF 2020 17TH INTERNATIONAL BHURBAN CONFERENCE ON APPLIED SCIENCES AND TECHNOLOGY (IBCAST) | 2020年

关键词：

Semantic Retrieval; Deep Convolutional Neural Networks (CNNs); Long-Short Term Memory Networks (LSTMs); IMAGE;

D O I：

10.1109/ibcast47879.2020.9044601

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Content based video retrieval has been an active research area for many decades. Unlike tagged-based search engines which rely on user-assigned annotations to retrieve the desired content, content based retrieval systems match the actual content of video with the provided query to fetch the required set of videos. Thanks to the recent advancements in deep learning, the traditional pipeline of content based systems (pre-processing, segmentation, object classification, action recognition etc.) is being replaced by end-to-end trainable systems which are not only effective and robust but also avoid the complex processing in the conventional image based techniques. The present study exploits these developments to develop a semantic video retrieval system accepting natural language queries and retrieving the relevant videos. We focus on key individuals appearing in certain scenarios as queries in the current study. Persons appearing in a video are recognized by tuning FaceNet to our set of images while caption generation is exploited to make sense of the scenario within a given video frame. The outputs of the two modules are combined to generate a description of the frame. During the retrieval phase, natural language queries are provided to the system and the concept of word embeddings is employed to find similar words to those appearing in the query text. For a given query, all videos where the queried individuals and scenarios have appeared are returned by the system. The preliminary experimental study on a collection of 50 videos reported promising retrieval results.

引用

页码：338 / 343

页数：6

共 29 条

[11]

Mikolov T, 2013, Eficient Estimation of Word Representations in Vector Space

[12] Content-Based Video Retrieval in Historical Collections of the German Broadcasting Archive [J].

Muehling, Markus ;

Meister, Manja ;

Korfhage, Nikolaus ;

Wehling, Joerg ;

Hoerth, Angelika ;

Ewerth, Ralph ;

Freisleben, Bernd .

RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, TPDL 2016, 2016, 9819 :67-78

[13]

Otani M., 2016, VIDEO SUMMARIZATION, P361

[14] Photobook: Content-based manipulation of image databases [J].

Pentland, A ;

Picard, RW ;

Sclaroff, S .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 1996, 18 (03) :233-254

[15] A 3D-CNN Based Video Hashing Method [J].

Qi, Haifeng ;

Li, Jing ;

Wu, Qiang ;

Wan, Wenbo ;

Sun, Jiande .

TENTH INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING (ICDIP 2018), 2018, 10806

[16]

Rawat Yogesh Singh, 2018, ACTION OBJECT DETECT

[17] You Only Look Once: Unified, Real-Time Object Detection [J].

Redmon, Joseph ;

Divvala, Santosh ;

Girshick, Ross ;

Farhadi, Ali .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :779-788

[18] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [J].

Ren, Shaoqing ;

He, Kaiming ;

Girshick, Ross ;

Sun, Jian .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (06) :1137-1149

[19]

Rossetto Luca, 2015, MultiMedia Modeling. 21st International Conference, MMM 2015. Proceedings: LNCS 8936, P255, DOI 10.1007/978-3-319-14442-9_24

[20]

Schroff F, 2015, PROC CVPR IEEE, P815, DOI 10.1109/CVPR.2015.7298682

← 1 2 3 →