VISREAS: Complex Visual Reasoning with Unanswerable Questions

被引:0
作者
Akter, Syeda Nahida [1 ]
Lee, Sangwu [2 ]
Chang, Yingshan [1 ]
Bisk, Yonatan [1 ]
Nyberg, Eric [1 ]
机构
[1] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
[2] Univ Rochester, Dept Comp Sci, Rochester, NY USA
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024 | 2024年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Verifying a question's validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VISREAS, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. VISREAS contains 2.07M semantically diverse queries generated automatically using Visual Genome scene graphs. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION that reasons by producing and executing pseudocode without any external modules to generate the answer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in performance against the classification models.(1)
引用
收藏
页码:6735 / 6752
页数:18
相关论文
共 38 条
  • [1] Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
    Agrawal, Aishwarya
    Batra, Dhruv
    Parikh, Devi
    Kembhavi, Aniruddha
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4971 - 4980
  • [2] Neural Module Networks
    Andreas, Jacob
    Rohrbach, Marcus
    Darrell, Trevor
    Klein, Dan
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 39 - 48
  • [3] Banerjee S., 2005, P ACL WORKSH INTR EX, P65, DOI DOI 10.3115/1626355.1626389
  • [4] Chen Lingjiao, 2023, How is chatgpt's behavior changing over time?
  • [5] Chen M., 2021, arXiv, DOI 10.48550/ARXIV.2107.03374
  • [6] Chiang W.L., 2023, Vicuna: An open -source chatbot impressing gpt-4 with 90%* chatgpt quality
  • [7] Dai Wenliang, 2023, In-structblip: Towards general-purpose vision-language models with instruction tuning
  • [8] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
    Goyal, Yash
    Khot, Tejas
    Summers-Stay, Douglas
    Batra, Dhruv
    Parikh, Devi
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6325 - 6334
  • [9] Visual Programming: Compositional visual reasoning without training
    Gupta, Tanmay
    Kembhavi, Aniruddha
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14953 - 14962
  • [10] Hu Edward J, 2022, INT C LEARNING REPR