Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval

被引:27
作者
Saito, Kuniaki [1 ,2 ]
Sohn, Kihyuk [3 ]
Zhang, Xiang [2 ]
Li, Chun-Liang [2 ]
Lee, Chen-Yu [2 ]
Saenko, Kate [1 ,4 ]
Pfister, Tomas [2 ]
机构
[1] Boston Univ, Boston, MA 02215 USA
[2] Google Cloud AI Res, Mountain View, CA 94043 USA
[3] Google Res, Mountain View, CA USA
[4] MIT IBM Watson AI Lab, Cambridge, MA USA
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.01850
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training. To this end, we propose a novel method, called Pic2Word, that requires only weakly labeled image-caption pairs and unlabeled image datasets to train. Unlike existing supervised CIR models, our model trained on weakly labeled or unlabeled datasets shows strong generalization across diverse ZS-CIR tasks, e.g., attribute editing, object composition, and domain conversion. Our approach outperforms several supervised CIR methods on the common CIR benchmark, CIRR and Fashion-IQ. Code will be made publicly available at https://github.com/google-research/composed_image_retrieval
引用
收藏
页码:19305 / 19314
页数:10
相关论文
共 39 条
  • [21] Lester Brian, 2021, ARXIV210408691
  • [22] Li Jiahao, 2021, NeurIPS, V34
  • [23] Li X., 2020, EUR C COMP VIS, P121, DOI DOI 10.1007/978-3-030-58577-88
  • [24] Microsoft COCO: Common Objects in Context
    Lin, Tsung-Yi
    Maire, Michael
    Belongie, Serge
    Hays, James
    Perona, Pietro
    Ramanan, Deva
    Dollar, Piotr
    Zitnick, C. Lawrence
    [J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755
  • [25] Improved Model Predictive Control Without Using Weighting Factor for Quasi-Z-Source Inverter
    Liu, Ping
    Tong, Linlin
    Chen, Zijian
    Bilal, Omair
    Huang, Shoudao
    Li, Shanhu
    [J]. 6TH IEEE INTERNATIONAL CONFERENCE ON PREDICTIVE CONTROL OF ELECTRICAL DRIVES AND POWER ELECTRONICS (PRECEDE 2021), 2021, : 46 - 50
  • [26] Liu Z., 2021, P IEEECVF INT C COMP, P2125
  • [27] Loshchilov I., 2018, INT C LEARNING REPRE
  • [28] Lu JS, 2019, ADV NEUR IN, V32
  • [29] Mokady Ron, 2021, ArXiv preprint ArXiv:2111.09734
  • [30] Probabilistic Compositional Embeddings for Multimodal Image Retrieval
    Neculai, Andrei
    Chen, Yanbei
    Akata, Zeynep
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4546 - 4556