Text encoders bottleneck compositionality in contrastive vision-language models

被引:0
|
作者
Kamath, Amita [1 ]
Hessel, Jack [2 ]
Chang, Kai-Wei [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90024 USA
[2] Allen Inst AI, Seattle, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach does not require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on more compositional inputs, including object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multimodal matching performance on ControlledImCaps: a new evaluation benchmark we collect and release consisting of fine-grained compositional images and captions. Specifically, our results suggest textonly recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive VL models. We release our datasets and code.
引用
收藏
页码:4933 / 4944
页数:12
相关论文
共 50 条
  • [21] Contrastive Vision-Language Pre-training with Limited Resources
    Cui, Quan
    Zhou, Boyan
    Guo, Yu
    Yin, Weidong
    Wu, Hao
    Yoshie, Osamu
    Chen, Yubo
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 236 - 253
  • [22] Contrastive Instruction-Trajectory Learning for Vision-Language Navigation
    Liang, Xiwen
    Zhu, Fengda
    Zhu, Yi
    Lin, Bingqian
    Wang, Bing
    Liang, Xiaodan
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1592 - 1600
  • [23] Vision-Language Pre-Training with Triple Contrastive Learning
    Yang, Jinyu
    Duan, Jiali
    Tran, Son
    Xu, Yi
    Chanda, Sampath
    Chen, Liqun
    Zeng, Belinda
    Chilimbi, Trishul
    Huang, Junzhou
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15650 - 15659
  • [24] Debiasing vision-language models for vision tasks: a survey
    Zhu, Beier
    Zhang, Hanwang
    FRONTIERS OF COMPUTER SCIENCE, 2025, 19 (01)
  • [25] TEXT-IMAGE DE-CONTEXTUALIZATION DETECTION USING VISION-LANGUAGE MODELS
    Huang, Mingzhen
    Jia, Shan
    Chang, Ming-Ching
    Lyu, Siwei
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8967 - 8971
  • [26] LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models
    Shi, Cheng
    Yang, Sibei
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2920 - 2929
  • [27] Conceptual Codebook Learning for Vision-Language Models
    Zhang, Yi
    Yu, Ke
    Wu, Siqi
    He, Zhihai
    COMPUTER VISION - ECCV 2024, PT LXXVII, 2024, 15135 : 235 - 251
  • [28] Unsupervised Prototype Adapter for Vision-Language Models
    Zhang, Yi
    Zhang, Ce
    Hu, Xueting
    He, Zhihai
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 197 - 209
  • [29] Conditional Prompt Learning for Vision-Language Models
    Zhou, Kaiyang
    Yang, Jingkang
    Loy, Chen Change
    Liu, Ziwei
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16795 - 16804
  • [30] Consistent prompt learning for vision-language models
    Zhang, Yonggang
    Tian, Xinmei
    KNOWLEDGE-BASED SYSTEMS, 2025, 310