What's "up" with vision-language models? Investigating their struggle with spatial reasoning

被引：0

作者：

Kamath, Amita ^{[1
]}

Hessel, Jack ^{[2
]}

Chang, Kai-Wei ^{[1
]}

机构：

[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA

[2] Allen Inst AI, Seattle, WA USA

来源：

2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent vision-language (VL) models are powerful, but can they reliably distinguish "right" from "left"? We curate three new corpora to quantify model comprehension of such basic spatial relations. These tests isolate spatial reasoning more precisely than existing datasets like VQAv2, e.g., our What'sUp benchmark contains sets of photographs varying only the spatial relations of objects, keeping their identity fixed (see Figure 1: models must comprehend not only the usual case of a dog under a table, but also, the same dog on top of the same table). We evaluate 18 VL models, finding that all perform poorly, e.g., BLIP finetuned on VQAv2, which nears human parity on VQAv2, achieves 56% accuracy on our benchmarks vs. humans at 99%. We conclude by studying causes of this surprising behavior, finding: 1) that popular vision-language pretraining corpora like LAION-2B contain little reliable data for learning spatial relationships; and 2) that basic modeling interventions like up-weighting preposition-containing instances or fine-tuning on our corpora are not sufficient to address the challenges our benchmarks pose. We are hopeful that these corpora will facilitate further research, and we release our data and code at https://github.com/amitakamath/ whatsup_vlms.

引用

页码：9161 / 9175

页数：15

共 36 条

[1] nocaps: novel object captioning at scale [J].

Agrawal, Harsh ;

Desai, Karan ;

Wang, Yufei ;

Chen, Xinlei ;

Jain, Rishabh ;

Johnson, Mark ;

Batra, Dhruv ;

Parikh, Devi ;

Lee, Stefan ;

Anderson, Peter .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8947-8956

[2]

Berg AC, 2012, PROC CVPR IEEE, P3562, DOI 10.1109/CVPR.2012.6248100

[3]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[4]

Gokhale Tejas, 2022, arXiv

[5] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].

Goyal, Yash ;

Khot, Tejas ;

Agrawal, Aishwarya ;

Summers-Stay, Douglas ;

Batra, Dhruv ;

Parikh, Devi .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (04) :398-414

[6]

Holtzman Ari, 2021, C EMP METH NAT LANG

[7]

Hsieh C.Y., 2023, ADV NEUR IN

[8]

Hu Yushi, 2023, 2023 IEEE C COMP VIS

[9]

Hudson D.A., 2019, P IEEE CVF C COMP VI, P6700, DOI DOI 10.1109/CVPR.2019.00686

[10]

Ilharco Gabriel., 2021, OpenCLIP

← 1 2 3 4 →