Video-Based Cross-Modal Recipe Retrieval

被引:24
作者
Cao, Da [1 ]
Yu, Zhiwang [1 ]
Zhang, Hanling [1 ]
Fang, Jiansheng [2 ]
Nie, Liqiang [3 ]
Tian, Qi [4 ]
机构
[1] Hunan Univ, Changsha, Peoples R China
[2] CVTE Res, Guangzhou, Peoples R China
[3] Shandong Univ, Jinan, Peoples R China
[4] Huawei, Noahs Ark Lab, Hong Kong, Peoples R China
来源
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19) | 2019年
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Recipe Retrieval; Video Retrieval; Parallel-Attention Network; Co-Attention Network; Cross-Modal Retrieval;
D O I
10.1145/3343031.3351067
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
As a natural extension of image-based cross-modal recipe retrieval, retrieving a specific video given a recipe as the query is seldom explored. There are various temporal and spatial elements hidden in cooking videos. In addition, current image-based cross-modal recipe retrieval approaches mostly emphasize the understanding of textual and visual content independently. Such methods overlook the interaction between textual and visual content. In this work, we innovatively propose a new problem of video-based cross-modal recipe retrieval and thoroughly investigate this issue under the attention paradigm. In particular, we firstly exploit a parallel-attention network to independently learn the representations of videos and recipes. Next, a co-attention network is proposed to explicitly emphasize the cross-modal interactive features between videos and recipes. Meanwhile, a cross-modal fusion sub-network is proposed to learn both the independent and collaborative dynamics, which can enhance the associated representation of videos and recipes. Last but not the least, the embedding vectors of videos and recipes stemming from joint network are optimized with a pairwise ranking loss. Extensive experiments on a self-collected dataset have verified the effectiveness and rationality of our proposed solution.
引用
收藏
页码:1685 / 1693
页数:9
相关论文
共 43 条
[1]  
[Anonymous], 2013, T ASSOC COMPUT LING
[2]   Techniques and systems for image and video retrieval [J].
Aslandogan, YA ;
Yu, CT .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1999, 11 (01) :56-63
[3]  
Bahdanau D., 2015, P 3 INT C LEARN REPR
[4]   Attentive Group Recommendation [J].
Cao, Da ;
He, Xiangnan ;
Miao, Lianhai ;
An, Yahui ;
Yang, Chao ;
Hong, Richang .
ACM/SIGIR PROCEEDINGS 2018, 2018, :645-654
[5]  
Cao Da, 2019, IEEE T 2019 KNOWL DA
[6]   Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings [J].
Carvalho, Micael ;
Cadene, Remi ;
Picard, David ;
Soulier, Laure ;
Thome, Nicolas ;
Cord, Matthieu .
ACM/SIGIR PROCEEDINGS 2018, 2018, :35-44
[7]   An Overview and Practical Application of Biological Intelligence Algorithm Used in Intelligence Control [J].
Chen, Jie ;
Cheng, Sheng ;
Xu, Meng .
PROCEEDINGS OF 2018 THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE (CSAI 2018) / 2018 THE 10TH INTERNATIONAL CONFERENCE ON INFORMATION AND MULTIMEDIA TECHNOLOGY (ICIMT 2018), 2018, :200-206
[8]   Cross-modal Recipe Retrieval with Rich Food Attributes [J].
Chen, Jing-Jing ;
Ngo, Chong-Wah ;
Chua, Tat-Seng .
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, :1771-1779
[9]   Deep-based Ingredient Recognition for Cooking Recipe Retrieval [J].
Chen, Jingjing ;
Ngo, Chong-Wah .
MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, :32-41
[10]   Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention [J].
Chen, Jingyuan ;
Zhang, Hanwang ;
He, Xiangnan ;
Nie, Liqiang ;
Liu, Wei ;
Chua, Tat-Seng .
SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, :335-344