Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

被引:83
|
作者
Thrush, Tristan [1 ]
Jiang, Ryan [3 ]
Bartolo, Max [4 ]
Singh, Amanpreet [1 ]
Williams, Adina [2 ]
Kiela, Douwe [1 ]
Ross, Candace [2 ]
机构
[1] Hugging Face, Brooklyn, NY 11201 USA
[2] Facebook AI Res, New York, NY USA
[3] Univ Waterloo, Waterloo, ON, Canada
[4] UCL, London, England
关键词
D O I
10.1109/CVPR52688.2022.00517
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly-but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.
引用
收藏
页码:5228 / 5238
页数:11
相关论文
共 50 条
  • [21] Large language models and linguistic intentionality
    Grindrod, Jumbly
    SYNTHESE, 2024, 204 (02)
  • [22] Discourse Probing of Pretrained Language Models
    Koto, Fajri
    Lau, Jey Han
    Baldwin, Timothy
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3849 - 3864
  • [23] Probing for Referential Information in Language Models
    Sorodoc, Ionut-Teodor
    Gulordava, Kristina
    Boleda, Gemma
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 4177 - 4189
  • [24] LINGUISTIC DEFINITION OF GENERIC MODELS IN COMPUTER VISION
    FRETWELL, P
    GOILLAU, PJ
    LECTURE NOTES IN COMPUTER SCIENCE, 1988, 301 : 306 - 314
  • [25] Towards a linguistic vision of the world at the paremiological level of language
    Kotova, Marina Yu
    Raina, Olga V.
    VESTNIK SANKT-PETERBURGSKOGO UNIVERSITETA-YAZYK I LITERATURA, 2020, 17 (03): : 487 - 504
  • [26] What Do Language Models Hear? Probing for Auditory Representations in Language Models
    Ngo, Jerry
    Kim, Yoon
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 5435 - 5448
  • [27] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [28] George Berkeley's language of vision and the occult tradition of linguistic Platonism. Part II: George Berkeley's language of vision and linguistic Platonism
    Isermann, Michael M.
    LANGUAGE & COMMUNICATION, 2008, 28 (01) : 57 - 92
  • [29] Incorporating linguistic structure into statistical language models
    Rosenfeld, R
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2000, 358 (1769): : 1311 - 1324
  • [30] Language models and linguistic theories beyond words
    Nature Machine Intelligence, 2023, 5 : 677 - 678