Multimodal Encoders for Food-Oriented Cross-Modal Retrieval

被引:5
作者
Chen, Ying [1 ]
Zhou, Dong [1 ]
Li, Lin [2 ]
Han, Jun-mei [3 ]
机构
[1] Hunan Univ Sci & Technol, Sch Comp Sci & Engn, Xiangtan 411201, Hunan, Peoples R China
[2] Wuhan Univ Technol, Sch Comp Sci & Technol, Wuhan 430070, Hubei, Peoples R China
[3] Inst Syst Engn, Dept Syst Gen Design, Natl Key Lab Complex Syst Simulat, Beijing 100101, Peoples R China
来源
WEB AND BIG DATA, APWEB-WAIM 2021, PT II | 2021年 / 12859卷
基金
中国国家自然科学基金;
关键词
Food-oriented computing; Cross-modal retrieval; Multimodal encoders; Modality alignment;
D O I
10.1007/978-3-030-85899-5_19
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The task of retrieving across different modalities plays a critical role in food-oriented applications. Modality alignment remains a challenging component in the whole process, in which a common embedding feature space between two modalities can be learned for effective comparison and retrieval. Recent studies mainly utilize adversarial loss or reconstruction loss to align different modalities. However, insufficient features may be extracted from different modalities, resulting in low quality of alignments. Unlike these methods, this paper proposes a method combining multimodal encoders with adversarial learning to learn improved and efficient cross-modal embeddings for retrieval purposes. The core of our proposed approach is the directional pairwise cross-modal attention that latently adapts representations from one modality to another. Although the model is not particularly complex, experimental results on the benchmark Recipe1M dataset show that our proposed method is superior to current state-of-the-art methods.
引用
收藏
页码:253 / 266
页数:14
相关论文
共 32 条
[1]   Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings [J].
Carvalho, Micael ;
Cadene, Remi ;
Picard, David ;
Soulier, Laure ;
Thome, Nicolas ;
Cord, Matthieu .
ACM/SIGIR PROCEEDINGS 2018, 2018, :35-44
[2]   MCEN: Bridging Cross-Modal Gap between Cooking Recipes and Dish Images with Latent Variable Model [J].
Fu, Han ;
Wu, Rui ;
Liu, Chenghao ;
Sun, Jianling .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :14558-14568
[3]   Attention driven multi-modal similarity learning [J].
Gao, Xinjian ;
Mu, Tingting ;
Goulermas, John Y. ;
Wang, Meng .
INFORMATION SCIENCES, 2018, 432 :530-542
[4]   Deep Reconstruction-Classification Networks for Unsupervised Domain Adaptation [J].
Ghifary, Muhammad ;
Kleijn, W. Bastiaan ;
Zhang, Mengjie ;
Balduzzi, David ;
Li, Wen .
COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 :597-613
[5]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[6]  
Hochreiter S., 1997, Neural Computation, V9, P1735
[7]   Relations between two sets of variates [J].
Hotelling, H .
BIOMETRIKA, 1936, 28 :321-377
[8]  
Gulrajani I, 2017, ADV NEUR IN, V30
[9]   Stacked Cross Attention for Image-Text Matching [J].
Lee, Kuang-Huei ;
Chen, Xi ;
Hua, Gang ;
Hu, Houdong ;
He, Xiaodong .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :212-228
[10]  
Lu JS, 2019, ADV NEUR IN, V32