Cross-modal recipe retrieval based on unified text encoder with fine-grained contrastive learning

被引:2
作者
Zhang, Bolin [1 ]
Kyutoku, Haruya [2 ]
Doman, Keisuke [3 ]
Komamizu, Takahiro [4 ]
Ide, Ichiro [5 ]
Qian, Jiangbo [1 ]
机构
[1] Ningbo Univ, Fac Elect Engn & Comp Sci, Ningbo, Zhejiang, Peoples R China
[2] Aichi Univ Technol, Fac Engn, Gamagori, Aichi, Japan
[3] Chukyo Univ, Sch Engn, Toyota, Aichi, Japan
[4] Nagoya Univ, Math & Data Sci Ctr, Nagoya, Aichi, Japan
[5] Nagoya Univ, Grad Sch Informat, Nagoya, Aichi, Japan
关键词
Cross-modal recipe retrieval; Unified text encoder; Contrastive learning;
D O I
10.1016/j.knosys.2024.112641
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal recipe retrieval is vital for transforming visual food cues into actionable cooking guidance, making culinary creativity more accessible. Existing methods separately encode the recipe Title, Ingredient, and Instruction using different text encoders, then aggregate them to obtain recipe feature, and finally match it with encoded image feature in a joint embedding space. These methods perform well but require significant computational cost. In addition, they only consider matching the entire recipe and the image but ignore the fine-grained correspondence between recipe components and the image, resulting in insufficient cross-modal interaction. To this end, we propose U nified T ext E ncoder with F ine-grained C ontrastive L earning (UTE-FCL) to achieve a simple but efficient model. Specifically, in each recipe, UTE-FCL first concatenates each of the Ingredient and Instruction texts composed of multiple sentences as a single text. Then, it connects these two concatenated texts with the original single-phrase Title to obtain the concatenated recipe. Finally, it encodes these three concatenated texts and the original Title by a Transformer-based Unified Text Encoder (UTE). This proposed structure greatly reduces the memory usage and improves the feature encoding efficiency. Further, we propose fine-grained contrastive learning objectives to capture the correspondence between recipe components and the image at Title, Ingredient, and Instruction levels by measuring the mutual information. Extensive experiments demonstrate the effectiveness of UTE-FCL compared to existing methods.
引用
收藏
页数:15
相关论文
共 50 条
[21]   Cross-Modal Contrastive Learning for Code Search [J].
Shi, Zejian ;
Xiong, Yun ;
Zhang, Xiaolong ;
Zhang, Yao ;
Li, Shanshan ;
Zhu, Yangyong .
2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2022), 2022, :94-105
[22]   CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching Imperfectly [J].
Zou, Zhuoyang ;
Zhu, Xinghui ;
Zhu, Qinying ;
Liu, Yi ;
Zhu, Lei .
IEEE ACCESS, 2024, 12 :33283-33295
[23]   Fine-Grained Semantics Enhanced Contrastive Learning for Graphs [J].
Liu, Youming ;
Shu, Lin ;
Chen, Chuan ;
Zheng, Zibin .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (12) :8238-8250
[24]   RANKING ENHANCED FINE-GRAINED CONTRASTIVE LEARNING FOR RECOMMENDATION [J].
Yao, Yunhang ;
Gao, Min ;
Zhou, Hongwei ;
Wang, Zongwei ;
Zhao, Zehua ;
Xiong, Qingyu .
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :7540-7544
[25]   Fine-Grained Contrastive Learning for Pulmonary Nodule Classification [J].
Zheng, Yubin ;
Tang, Peng ;
Ju, Tianjie ;
Qiu, Weidong ;
Yan, Bo .
2024 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN 2024, 2024,
[26]   Unsupervised Deep Hashing With Fine-Grained Similarity-Preserving Contrastive Learning for Image Retrieval [J].
Cao, Hu ;
Huang, Lei ;
Nie, Jie ;
Wei, Zhiqiang .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) :4095-4108
[27]   Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning [J].
Nie, Zhijie ;
Zhang, Richong ;
Feng, Zhangchi ;
Huang, Hailang ;
Liu, Xudong .
PROCEEDINGS OF THE 30TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2024, 2024, :2272-2283
[28]   Cross-modal contrastive learning for multimodal sentiment recognition [J].
Yang, Shanliang ;
Cui, Lichao ;
Wang, Lei ;
Wang, Tao .
APPLIED INTELLIGENCE, 2024, 54 (05) :4260-4276
[29]   Cross-modal contrastive learning for multimodal sentiment recognition [J].
Shanliang Yang ;
Lichao Cui ;
Lei Wang ;
Tao Wang .
Applied Intelligence, 2024, 54 :4260-4276
[30]   Method based on contrastive incremental learning for fine-grained malicious traffic classification [J].
Wang Y. ;
Guo Y. ;
Chen Q. ;
Fang C. ;
Lin R. ;
Zhou Y. ;
Ma J. .
Tongxin Xuebao/Journal on Communications, 2023, 44 (03) :1-11