Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

被引：83

作者：

Thrush, Tristan ^{[1
]}

Jiang, Ryan ^{[3
]}

Bartolo, Max ^{[4
]}

Singh, Amanpreet ^{[1
]}

Williams, Adina ^{[2
]}

Kiela, Douwe ^{[1
]}

Ross, Candace ^{[2
]}

机构：

[1] Hugging Face, Brooklyn, NY 11201 USA

[2] Facebook AI Res, New York, NY USA

[3] Univ Waterloo, Waterloo, ON, Canada

[4] UCL, London, England

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00517

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly-but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.

引用

页码：5228 / 5238

页数：11

共 50 条

[21] Large language models and linguistic intentionality
Grindrod, Jumbly
SYNTHESE, 2024, 204 (02)
[22] Discourse Probing of Pretrained Language Models
Koto, Fajri
Lau, Jey Han
Baldwin, Timothy
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3849 - 3864
[23] Probing for Referential Information in Language Models
Sorodoc, Ionut-Teodor
Gulordava, Kristina
Boleda, Gemma
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 4177 - 4189
[24] LINGUISTIC DEFINITION OF GENERIC MODELS IN COMPUTER VISION
FRETWELL, P
GOILLAU, PJ
LECTURE NOTES IN COMPUTER SCIENCE, 1988, 301 : 306 - 314
[25] Towards a linguistic vision of the world at the paremiological level of language
Kotova, Marina Yu
Raina, Olga V.
VESTNIK SANKT-PETERBURGSKOGO UNIVERSITETA-YAZYK I LITERATURA, 2020, 17 (03): : 487 - 504
[26] What Do Language Models Hear? Probing for Auditory Representations in Language Models
Ngo, Jerry
Kim, Yoon
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 5435 - 5448
[27] Vision-Language Models for Vision Tasks: A Survey
Zhang, Jingyi
Huang, Jiaxing
Jin, Sheng
Lu, Shijian
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
[28] George Berkeley's language of vision and the occult tradition of linguistic Platonism. Part II: George Berkeley's language of vision and linguistic Platonism
Isermann, Michael M.
LANGUAGE & COMMUNICATION, 2008, 28 (01) : 57 - 92
[29] Incorporating linguistic structure into statistical language models
Rosenfeld, R
PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2000, 358 (1769): : 1311 - 1324
[30] Language models and linguistic theories beyond words
Nature Machine Intelligence, 2023, 5 : 677 - 678

← 1 2 3 4 5 →