LLM-Enhanced Composed Image Retrieval: An Intent Uncertainty-Aware Linguistic-Visual Dual Channel Matching Model

被引:1
作者
Ge, Hongfei [1 ]
Jiang, Yuanchun [1 ,2 ]
Sun, Jianshan [1 ]
Yuan, Kun [1 ,3 ]
Liu, Yezheng [1 ,4 ]
机构
[1] Hefei Univ Technol, Sch Management, Hefei, Peoples R China
[2] Minist Educ, Key Lab Proc Optimizat & Intelligent Decis Making, Hefei, Peoples R China
[3] Key Lab Philosophy & Social Sci Cyberspace Behav &, Hefei, Peoples R China
[4] Natl Engn Lab Big Data Distribut & Exchange Techno, Hefei, Peoples R China
基金
中国国家自然科学基金;
关键词
Image retrieval; multi-modal retrieval; intent uncertainty; large language model;
D O I
10.1145/3699715
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users' ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually annotated datasets limit the further performance improvement of CoIR. To this end, this article proposes an LLM-enhanced Intent Uncertainty-Aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for CoIR. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of I UDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.
引用
收藏
页数:30
相关论文
empty
未找到相关数据