VISREAS: Complex Visual Reasoning with Unanswerable Questions

被引：0

作者：

Akter, Syeda Nahida ^{[1
]}

Lee, Sangwu ^{[2
]}

Chang, Yingshan ^{[1
]}

Bisk, Yonatan ^{[1
]}

Nyberg, Eric ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

[2] Univ Rochester, Dept Comp Sci, Rochester, NY USA

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024 | 2024年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Verifying a question's validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VISREAS, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. VISREAS contains 2.07M semantically diverse queries generated automatically using Visual Genome scene graphs. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION that reasons by producing and executing pseudocode without any external modules to generate the answer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in performance against the classification models.(1)

引用

页码：6735 / 6752

页数：18

共 38 条

[1] Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
Agrawal, Aishwarya
Batra, Dhruv
Parikh, Devi
Kembhavi, Aniruddha
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4971 - 4980
[2] Neural Module Networks
Andreas, Jacob
Rohrbach, Marcus
Darrell, Trevor
Klein, Dan
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 39 - 48
[3] Banerjee S., 2005, P ACL WORKSH INTR EX, P65, DOI DOI 10.3115/1626355.1626389
[4] Chen Lingjiao, 2023, How is chatgpt's behavior changing over time?
[5] Chen M., 2021, arXiv, DOI 10.48550/ARXIV.2107.03374
[6] Chiang W.L., 2023, Vicuna: An open -source chatbot impressing gpt-4 with 90%* chatgpt quality
[7] Dai Wenliang, 2023, In-structblip: Towards general-purpose vision-language models with instruction tuning
[8] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Goyal, Yash
Khot, Tejas
Summers-Stay, Douglas
Batra, Dhruv
Parikh, Devi
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6325 - 6334
[9] Visual Programming: Compositional visual reasoning without training
Gupta, Tanmay
Kembhavi, Aniruddha
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14953 - 14962
[10] Hu Edward J, 2022, INT C LEARNING REPR

← 1 2 3 4 →