Zero-Shot Composed Image Retrieval with Textual Inversion

被引：23

作者：

Baldrati, Alberto ^{[1
,2
]}

Agnolucci, Lorenzo ^{[1
]}

Bertini, Marco ^{[1
]}

Del Bimbo, Alberto ^{[1
]}

机构：

[1] Univ Florence, Media Integrat & Commun Ctr MICC, Florence, Italy

[2] Univ Pisa, Pisa, Italy

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年

基金：

欧盟地平线“2020”;

关键词：

D O I：

10.1109/ICCV51070.2023.01407

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion (SEARLE), maps the visual features of the reference image into a pseudoword token in CLIP token embedding space and integrates it with the relative caption. To support research on ZSCIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE.

引用

页码：15292 / 15301

页数：10

共 38 条

[1] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[2] Effective conditioned and composed image retrieval combining CLIP-based features [J].

Baldrati, Alberto ;

Bertini, Marco ;

Uricchio, Tiberio ;

Del Bimbo, Alberto .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :21434-21442

[3] Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features [J].

Baldrati, Alberto ;

Bertini, Marco ;

Uricchio, Tiberio ;

Del Bimbo, Alberto .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, :4955-4964

[4]

Berg TL, 2010, LECT NOTES COMPUT SC, V6311, P663, DOI 10.1007/978-3-642-15549-9_48

[5]

Brown TB, 2020, ADV NEUR IN, V33

[6]

Chen GB, 2017, ADV NEUR IN, V30

[7]

Cohen Niv, 2022, P EUR C COMP VIS ECC

[8]

Cornia M, 2020, PROC CVPR IEEE, P10575, DOI 10.1109/CVPR42600.2020.01059

[9]

Daras Giannis, 2022, NEURIPS 2022 WORKSH

[10]

Delmas Ginger, 2022, P INT C LEARN REPR I

← 1 2 3 4 →