Be flexible! learn to debias by sampling and prompting for robust visual question answering

被引:10
作者
Liu, Jin [1 ]
Fan, ChongFeng [1 ]
Zhou, Fengyu [1 ]
Xu, Huijuan [2 ]
机构
[1] Shandong Univ, Jinan, Peoples R China
[2] Penn State Univ, Philadelphia, PA USA
关键词
Question debiasing; Visual question answering; Contrastive sampling; Question-type guided prompt;
D O I
10.1016/j.ipm.2023.103296
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent studies point out that VQA models tend to rely on the language prior in the training data to answer the questions, which prevents the VQA model from generalization on the out-of-distribution test data. To address this problem, approaches are designed to reduce the language distribution prior effect by constructing negative image-question pairs, while they cannot provide the proper visual reason for answering the question. In this paper, we present a new debiasing framework for VQA by Learning to Sample paired image-question and Prompt for given question (LSP). Specifically, we construct the negative image-question pairs with certain sampling rate to prevent the model from overly relying on the visual shortcut content. Notably, question types provide a strong hint for answering the questions. We utilize question type to constrain the sampling process for negative question-image pairs, and further learn the question type-guided prompt for better question comprehension. Extensive experiments on two public benchmarks, VQA-CP v2 and VQA v2, demonstrate that our model achieves new state-of-the-art results in overall accuracy, i.e., 61.95% and 65.26%.
引用
收藏
页数:14
相关论文
共 41 条
[1]   Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].
Agrawal, Aishwarya ;
Batra, Dhruv ;
Parikh, Devi ;
Kembhavi, Aniruddha .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980
[2]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[3]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[4]  
Cadene R., 2019, ADV NEUR IN, P1
[5]   Counterfactual Samples Synthesizing for Robust Visual Question Answering [J].
Chen, Long ;
Yan, Xin ;
Xiao, Jun ;
Zhang, Hanwang ;
Pu, Shiliang ;
Zhuang, Yueting .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10797-10806
[6]  
Cho K., 2014, P EMPIRICAL METHODS, P1724, DOI 10.48550/arXiv.1406.1078
[7]  
Clark C, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P4069
[8]  
Ding N., 2021, arXiv, DOI DOI 10.48550/ARXIV.2108.10604
[9]   Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].
Goyal, Yash ;
Khot, Tejas ;
Summers-Stay, Douglas ;
Batra, Dhruv ;
Parikh, Devi .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6325-6334
[10]  
Grand Gabriel, 2019, P 2 WORKSH SHORTC VI, P3