Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

被引:349
|
作者
Agrawal, Aishwarya [1 ,3 ]
Batra, Dhruv [1 ,2 ]
Parikh, Devi [1 ,2 ]
Kembhavi, Aniruddha [3 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] Facebook AI Res, Menlo Pk, CA USA
[3] Allen Inst Artificial Intelligence, Seattle, WA USA
来源
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年
关键词
D O I
10.1109/CVPR.2018.00522
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A number of studies have found that today's Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we present new splits of the VQA nu 1 and VQA nu 2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP nu 1 and VQA-CP nu 2 respectively). First, we evaluate several existing VQA models under this new setting and show that their performance degrades significantly compared to the original VQA setting. Second, we propose a novel Grounded Visual Question Answering model (GVQA) that contains inductive biases and restrictions in the architecture specifically designed to prevent the model from 'cheating' by primarily relying on priors in the training data. Specifically, GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers. GVQA is built off an existing VQA model-Stacked Attention Networks (SAN). Our experiments demonstrate that GVQA significantly outperforms SAN on both VQA-CP nu 1 and VQA-CP nu 2 datasets. Interestingly, it also outperforms more powerful VQA models such as Multimodal Compact Bilinear Pooling (MCB) in several cases. GVQA offers strengths complementary to SAN when trained and evaluated on the original VQA nu 1 and VQA nu 2 datasets. Finally, GVQA is more transparent and interpretable than existing VQA models.
引用
收藏
页码:4971 / 4980
页数:10
相关论文
共 50 条
  • [1] Overcoming Language Priors in Visual Question Answering with Adversarial Regularization
    Ramakrishnan, Sainandan
    Agrawal, Aishwarya
    Lee, Stefan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [2] Overcoming Language Priors with Counterfactual Inference for Visual Question Answering
    Ren, Zhibo
    Wang, Huizhen
    Zhu, Muhua
    Wang, Yichao
    Xiao, Tong
    Zhu, Jingbo
    CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 58 - 71
  • [3] Overcoming language priors in visual question answering with cumulative learning strategy
    Mao, Aihua
    Chen, Feng
    Ma, Ziying
    Lin, Ken
    Neurocomputing, 2024, 608
  • [4] Overcoming language priors with self-contrastive learning for visual question answering
    Hong Yan
    Lijun Liu
    Xupeng Feng
    Qingsong Huang
    Multimedia Tools and Applications, 2023, 82 : 16343 - 16358
  • [5] Overcoming language priors with self-contrastive learning for visual question answering
    Yan, Hong
    Liu, Lijun
    Feng, Xupeng
    Huang, Qingsong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (11) : 16343 - 16358
  • [6] Overcoming Language Priors with Self-supervised Learning for Visual Question Answering
    Zhi, Xi
    Mao, Zhendong
    Liu, Chunxiao
    Zhang, Peng
    Wang, Bin
    Zhang, Yongdong
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1083 - 1089
  • [7] Overcoming Language Priors via Shuffling Language Bias for Robust Visual Question Answering
    Zhao, J.
    Yu, Z.
    Zhang, X.
    Yang, Y.
    IEEE ACCESS, 2023, 11 : 85980 - 85989
  • [8] Answer Distillation for Visual Question Answering
    Fang, Zhiwei
    Liu, Jing
    Tang, Qu
    Li, Yong
    Lu, Hanqing
    COMPUTER VISION - ACCV 2018, PT I, 2019, 11361 : 72 - 87
  • [9] Guiding Visual Question Answering with Attention Priors
    Le, Thao Minh
    Le, Vuong
    Gupta, Sunil
    Venkatesh, Svetha
    Tran, Truyen
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 4370 - 4379
  • [10] Overcoming Language Priors for Visual Question Answering via Loss Rebalancing Label and Global Context
    Cao, Runlin
    Li, Zhixin
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE, 2023, 216 : 249 - 259