VISREAS: Complex Visual Reasoning with Unanswerable Questions

被引:0
作者
Akter, Syeda Nahida [1 ]
Lee, Sangwu [2 ]
Chang, Yingshan [1 ]
Bisk, Yonatan [1 ]
Nyberg, Eric [1 ]
机构
[1] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
[2] Univ Rochester, Dept Comp Sci, Rochester, NY USA
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024 | 2024年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Verifying a question's validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VISREAS, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. VISREAS contains 2.07M semantically diverse queries generated automatically using Visual Genome scene graphs. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, LOGIC2VISION that reasons by producing and executing pseudocode without any external modules to generate the answer. LOGIC2VISION outperforms generative models in VISREAS (+4.82% over LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in performance against the classification models.(1)
引用
收藏
页码:6735 / 6752
页数:18
相关论文
共 38 条
  • [11] Learning to Reason: End-to-End Module Networks for Visual Question Answering
    Hu, Ronghang
    Andreas, Jacob
    Rohrbach, Marcus
    Darrell, Trevor
    Saenko, Kate
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 804 - 813
  • [12] Hudson Drew, 2019, ADV NEURAL INFORM PR, V32
  • [13] Hudson Drew A, 2019, CONFER ENCE COMPUTE
  • [14] Hudson Drew A, 2018, INT C LEARNING REPR
  • [15] Jain S, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P3543
  • [16] Jia Robin, 2017, P 2017 C EMP METH NA, P2021, DOI DOI 10.18653/V1/D17-1215
  • [17] CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
    Johnson, Justin
    Hariharan, Bharath
    van der Maaten, Laurens
    Fei-Fei, Li
    Zitnick, C. Lawrence
    Girshick, Ross
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1988 - 1997
  • [18] Visual question answering: Datasets, algorithms, and future challenges
    Kafle, Kushal
    Kanan, Christopher
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2017, 163 : 3 - 20
  • [19] Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
    Krishna, Ranjay
    Zhu, Yuke
    Groth, Oliver
    Johnson, Justin
    Hata, Kenji
    Kravitz, Joshua
    Chen, Stephanie
    Kalantidis, Yannis
    Li, Li-Jia
    Shamma, David A.
    Bernstein, Michael S.
    Li Fei-Fei
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 32 - 73
  • [20] Li Junnan, 2023, ICML