Text encoders bottleneck compositionality in contrastive vision-language models

被引：0

作者：

Kamath, Amita ^{[1
]}

Hessel, Jack ^{[2
]}

Chang, Kai-Wei ^{[1
]}

机构：

[1] Univ Calif Los Angeles, Los Angeles, CA 90024 USA

[2] Allen Inst AI, Seattle, WA USA

来源：

2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023 | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach does not require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on more compositional inputs, including object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multimodal matching performance on ControlledImCaps: a new evaluation benchmark we collect and release consisting of fine-grained compositional images and captions. Specifically, our results suggest textonly recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive VL models. We release our datasets and code.

引用

页码：4933 / 4944

页数：12

共 50 条

[31] VLCAP: VISION-LANGUAGE WITH CONTRASTIVE LEARNING FOR COHERENT VIDEO PARAGRAPH CAPTIONING
Yamazaki, Kashu
Truong, Sang
Vo, Khoa
Kidd, Michael
Rainwater, Chase
Luu, Khoa
Le, Ngan
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 3656 - 3661
[32] Exploring Vision-Language Models for Imbalanced Learning
Wang Y.
Yu Z.
Wang J.
Heng Q.
Chen H.
Ye W.
Xie R.
Xie X.
Zhang S.
International Journal of Computer Vision, 2024, 132 (01) : 224 - 237
[33] Vision-Language Models for Robot Success Detection
Luo, Fiona
THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23750 - 23752
[34] Adversarial Prompt Tuning for Vision-Language Models
Zhang, Jiaming
Ma, Xingjun
Wang, Xin
Qiu, Lingyu
Wang, Jiaqi
Jiang, Yu-Gang
Sang, Jitao
COMPUTER VISION - ECCV 2024, PT XLV, 2025, 15103 : 56 - 72
[35] Task Residual for Tuning Vision-Language Models
Yu, Tao
Lu, Zhihe
Jin, Xin
Chen, Zhibo
Wang, Xinchao
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
[36] Equivariant Similarity for Vision-Language Foundation Models
Wang, Tan
Lin, Kevin
Li, Linjie
Lin, Chung-Ching
Yang, Zhengyuan
Zhang, Hanwang
Liu, Zicheng
Wang, Lijuan
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11964 - 11974
[37] Adventures of Trustworthy Vision-Language Models: A Survey
Vatsa, Mayank
Jain, Anubhooti
Singh, Richa
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
[38] Towards Better Vision-Inspired Vision-Language Models
Cao, Yun-Hao
Ji, Kaixiang
Huang, Ziyuan
Zheng, Chuanyang
Liu, Jiajia
Wang, Jian
Chen, Jingdong
Yang, Ming
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13537 - 13547
[39] SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text
Zou, Bo
Yang, Chao
Quan, Chengbin
Zhao, Youjian
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 519 - 528
[40] Generative Negative Text Replay for Continual Vision-Language Pretraining
Yan, Shipeng
Hong, Lanqing
Xu, Hang
Han, Jianhua
Tuytelaars, Tinne
Li, Zhenguo
He, Xuming
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 22 - 38

← 1 2 3 4 5 →