Counterfactual VQA: A Cause-Effect Look at Language Bias

被引:241
作者
Niu, Yulei [1 ]
Tang, Kaihua [1 ]
Zhang, Hanwang [1 ]
Lu, Zhiwu [2 ,3 ]
Hua, Xian-Sheng [4 ]
Wen, Ji-Rong [2 ,3 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China
[3] Beijing Key Lab Big Data Management & Anal Method, Beijing, Peoples R China
[4] Alibaba Grp, Damo Acad, Hangzhou, Peoples R China
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
基金
中国国家自然科学基金;
关键词
INFERENCE; ATTENTION;
D O I
10.1109/CVPR46437.2021.01251
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
VQA models may tend to rely on language bias as a shortcut and thus fail to sufficiently learn the multi-modal knowledge from both vision and language. Recent debiasing methods proposed to exclude the language prior during inference. However, they fail to disentangle the "good" language context and "bad" language bias from the whole. In this paper, we investigate how to mitigate language bias in VQA. Motivated by causal effects, we proposed a novel counterfactual inference framework, which enables us to capture the language bias as the direct causal effect of questions on answers and reduce the language bias by subtracting the direct language effect from the total causal effect. Experiments demonstrate that our proposed counterfactual inference framework 1) is general to various VQA backbones and fusion strategies, 2) achieves competitive performance on the language-bias sensitive VQA-CP dataset while performs robustly on the balanced VQA v2 dataset without any augmented data.
引用
收藏
页码:12695 / 12705
页数:11
相关论文
共 54 条
[1]   Counterfactual Vision and Language Learning [J].
Abbasnejad, Ehsan ;
Teney, Damien ;
Parvaneh, Amin ;
Shi, Javen ;
van den Hengel, Anton .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10041-10051
[2]   Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].
Agrawal, Aishwarya ;
Batra, Dhruv ;
Parikh, Devi ;
Kembhavi, Aniruddha .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980
[3]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[4]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[5]  
[Anonymous], 2016, P 2016 C EMPIRICAL M, DOI DOI 10.18653/V1/D16-1203
[6]  
[Anonymous], 2018, ADV NEUR IN
[7]  
[Anonymous], 2020, AAAI
[8]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[9]  
Cadene R, 2019, ADV NEUR IN, V32
[10]   MUREL: Multimodal Relational Reasoning for Visual Question Answering [J].
Cadene, Remi ;
Ben-younes, Hedi ;
Cord, Matthieu ;
Thome, Nicolas .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1989-1998