VISREAS: Complex Visual Reasoning with Unanswerable Questions

被引：0

作者：

Akter, Syeda Nahida ^{[1
]}

Lee, Sangwu ^{[2
]}

Chang, Yingshan ^{[1
]}

Bisk, Yonatan ^{[1
]}

Nyberg, Eric ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

[2] Univ Rochester, Dept Comp Sci, Rochester, NY USA

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024 | 2024年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Verifying a question's validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VISREAS, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. VISREAS contains 2.07M semantically diverse queries generated automatically using Visual Genome scene graphs. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION that reasons by producing and executing pseudocode without any external modules to generate the answer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in performance against the classification models.(1)

引用

页码：6735 / 6752

页数：18

共 38 条

[11] Learning to Reason: End-to-End Module Networks for Visual Question Answering
Hu, Ronghang
Andreas, Jacob
Rohrbach, Marcus
Darrell, Trevor
Saenko, Kate
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 804 - 813
[12] Hudson Drew, 2019, ADV NEURAL INFORM PR, V32
[13] Hudson Drew A, 2019, CONFER ENCE COMPUTE
[14] Hudson Drew A, 2018, INT C LEARNING REPR
[15] Jain S, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P3543
[16] Jia Robin, 2017, P 2017 C EMP METH NA, P2021, DOI DOI 10.18653/V1/D17-1215
[17] CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
Johnson, Justin
Hariharan, Bharath
van der Maaten, Laurens
Fei-Fei, Li
Zitnick, C. Lawrence
Girshick, Ross
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1988 - 1997
[18] Visual question answering: Datasets, algorithms, and future challenges
Kafle, Kushal
Kanan, Christopher
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2017, 163 : 3 - 20
[19] Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Krishna, Ranjay
Zhu, Yuke
Groth, Oliver
Johnson, Justin
Hata, Kenji
Kravitz, Joshua
Chen, Stephanie
Kalantidis, Yannis
Li, Li-Jia
Shamma, David A.
Bernstein, Michael S.
Li Fei-Fei
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 32 - 73
[20] Li Junnan, 2023, ICML

← 1 2 3 4 →