Video-based recipe retrieval

被引:6
作者
Cao, Da [1 ]
Han, Ning [1 ]
Chen, Hao [1 ]
Wei, Xiaochi [2 ]
He, Xiangnan [3 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Hunan, Peoples R China
[2] Baidu Inc, Page Searching Dept, Baidu Technol Pk Bldg 1,10 Xibeiwang East Rd, Beijing 100193, Peoples R China
[3] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230026, Anhui, Peoples R China
基金
中国国家自然科学基金;
关键词
Recipe retrieval; Video retrieval; Hierarchical attention network; Deep reinforcement learning; Cross-modal retrieval; SALIENCY DETECTION; EVENT DETECTION;
D O I
10.1016/j.ins.2019.11.033
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recipe retrieval has received great attention in the research community, which focuses on retrieving a textual recipe given a text or an image as the query. However, cooking is an interesting activity, and many useful elements are hidden in the dynamic videos, which might be omitted in the statistic texts and images. On the other hand, although a number of video-based retrieval methods have been investigated in the past, existing technologies mainly focus on general applications and seldom take the domain-specific feature into account. To bridge the above gap, we investigate a new problem of video-based recipe retrieval, which refers to retrieving a cooking video from a list of video candidates given a textual recipe as the query, or the reverse side. In this work, we first propose a hierarchical attention network to learn the representations of textual recipe and its cooking procedures. Moreover, we employ reinforcement learning to dynamically locate a video moment given a cooking procedure as the query. Thereafter, the representations of video moments and cooking procedures are projected into a common space and optimized with a pairwise ranking loss, which is able to distinguish the matched and unmatched video moment-cooking procedure pairs. Therefore, the retrieval process between cooking videos and textual recipes is performed as the assembling matching results of video moments and cooking procedures. By experimenting on a self-collected dataset, we demonstrate the effectiveness and rationality of our proposed solution on the scope of both overall performance comparison and micro-level analyses. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:302 / 318
页数:17
相关论文
共 50 条
[1]   Unsupervised Learning from Narrated Instruction Videos [J].
Alayrac, Jean-Baptiste ;
Bojanowski, Piotr ;
Agrawal, Nishant ;
Sivic, Josef ;
Laptev, Ivan ;
Lacoste-Julien, Simon .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4575-4583
[2]  
[Anonymous], ARXIV160908124
[3]  
[Anonymous], 2018, ARXIV PREPRINT ARXIV
[4]  
[Anonymous], 2014, Transactions of the Association for Computational Linguistics, DOI DOI 10.1162/TACLA00177
[5]  
[Anonymous], 2013, NIPS
[6]  
[Anonymous], P INT C LEARN REPR
[7]  
[Anonymous], IEEE T IND INFORM
[8]   Techniques and systems for image and video retrieval [J].
Aslandogan, YA ;
Yu, CT .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1999, 11 (01) :56-63
[9]   Weakly-Supervised Alignment of Video With Text [J].
Bojanowski, P. ;
Lajugie, R. ;
Grave, E. ;
Bach, F. ;
Laptev, I. ;
Ponce, J. ;
Schmid, C. .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4462-4470
[10]   Active Object Localization with Deep Reinforcement Learning [J].
Caicedo, Juan C. ;
Lazebnik, Svetlana .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2488-2496