Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

被引:0
作者
Lee, Saehyung [1 ]
Yu, Sangwon [1 ]
Park, Junsung [1 ]
Yi, Jihun [1 ]
Yoon, Sungroh [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul, South Korea
[2] Seoul Natl Univ, Interdisciplinary Program Artificial Intelligence, Seoul, South Korea
来源
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | 2024年
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we primarily address the issue of dialogue-form context query within the interactive text-to-image retrieval task. Our methodology, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. This approach mitigates the issues of noisiness and redundancy in the generated questions. Beyond our methodology, we propose a novel evaluation metric, Best log Rank Integral (BRI), for a comprehensive assessment of the interactive retrieval system. PlugIR demonstrates superior performance compared to both zero-shot and fine-tuned baselines in various benchmarks. Additionally, the two methodologies comprising PlugIR can be flexibly applied together or separately in various situations. Our codes are available at https://github.com/Saehyung-Lee/PlugIR.
引用
收藏
页码:791 / 809
页数:19
相关论文
共 29 条
  • [1] Effective conditioned and composed image retrieval combining CLIP-based features
    Baldrati, Alberto
    Bertini, Marco
    Uricchio, Tiberio
    Del Bimbo, Alberto
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 21434 - 21442
  • [2] Visual Dialog
    Das, Abhishek
    Kottur, Satwik
    Gupta, Khushi
    Singh, Avi
    Yadav, Deshraj
    Moura, Jose M. F.
    Parikh, Devi
    Batra, Dhruv
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1080 - 1089
  • [3] Guo XX, 2018, ADV NEUR IN, V31
  • [4] Karthik S, 2024, Arxiv, DOI arXiv:2310.09291
  • [5] Levy M, 2023, Arxiv, DOI arXiv:2305.20062
  • [6] Levy M, 2023, Arxiv, DOI arXiv:2303.09429
  • [7] Li JN, 2022, PR MACH LEARN RES
  • [8] Li JN, 2023, Arxiv, DOI [arXiv:2301.12597, DOI 10.48550/ARXIV.2301.12597]
  • [9] Microsoft COCO: Common Objects in Context
    Lin, Tsung-Yi
    Maire, Michael
    Belongie, Serge
    Hays, James
    Perona, Pietro
    Ramanan, Deva
    Dollar, Piotr
    Zitnick, C. Lawrence
    [J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755
  • [10] Liu H., 2024, Advances in Neural Information Processing Systems, V36