Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

被引:83
|
作者
Thrush, Tristan [1 ]
Jiang, Ryan [3 ]
Bartolo, Max [4 ]
Singh, Amanpreet [1 ]
Williams, Adina [2 ]
Kiela, Douwe [1 ]
Ross, Candace [2 ]
机构
[1] Hugging Face, Brooklyn, NY 11201 USA
[2] Facebook AI Res, New York, NY USA
[3] Univ Waterloo, Waterloo, ON, Canada
[4] UCL, London, England
关键词
D O I
10.1109/CVPR52688.2022.00517
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly-but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.
引用
收藏
页码:5228 / 5238
页数:11
相关论文
共 50 条
  • [1] Critical Analysis of Deconfounded Pretraining to Improve Visio-Linguistic Models
    Cornille, Nathan
    Laenen, Katrien
    Moens, Marie-Francine
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2022, 5
  • [2] DeVLBert: Learning Deconfounded Visio-Linguistic Representations
    Zhang, Shengyu
    Jiang, Tan
    Wang, Tan
    Kuang, Kun
    Zhao, Zhou
    Zhu, Jianke
    Yu, Jin
    Yang, Hongxia
    Wu, Fei
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4373 - 4382
  • [3] DeVLBert: Out-of-distribution Visio-Linguistic Pretraining with Causality
    Zhang, Shengyu
    Jiang, Tan
    Wang, Tan
    Kuang, Kun
    Zhao, Zhou
    Zhu, Jianke
    Yu, Jin
    Yang, Hongxia
    Wu, Fei
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1744 - 1747
  • [4] What You Say Is Not What You Do: Studying Visio-Linguistic Models for TV Series Summarization
    Reboud, Alison
    Troncy, Raphael
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3142 - 3146
  • [5] Text encoders bottleneck compositionality in contrastive vision-language models
    Kamath, Amita
    Hessel, Jack
    Chang, Kai-Wei
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4933 - 4944
  • [6] (sic) ECHO: A Visio-Linguistic Dataset for Event Causality Inference via Human-Centric ReasOning
    Xie, Yuxi
    Li, Guanzhen
    Kan, Min-Yen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4064 - 4085
  • [7] Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding
    Zhang, Le
    Awal, Rabiul
    Agrawal, Aishwarya
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13774 - 13784
  • [8] ML2MG-VLCR: A Multimodal LLM Guided Zero-shot Method for Visio-linguistic Compositional Reasoning with Autoregressive Generative Language Model
    Gong, Ziyu
    Mai, Chengcheng
    Huang, Yihua
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 842 - 850
  • [9] Probing vision and language models for construction waste material recognition
    Sun, Ying
    Gu, Zhaolin
    Yang, Sean Bin
    AUTOMATION IN CONSTRUCTION, 2024, 166
  • [10] Measuring and Narrowing the Compositionality Gap in Language Models
    Press, Ofir
    Zhang, Muru
    Min, Sewon
    Schmidt, Ludwig
    Smith, Noah A.
    Lewis, Mike
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 5687 - 5711