VQA-BC: ROBUST VISUAL QUESTION ANSWERING VIA BIDIRECTIONAL CHAINING

被引：3

作者：

Lao, Mingrui ^{[1
]}

Guo, Yanming ^{[2
]}

Chen, Wei ^{[1
]}

Pu, Nan ^{[1
]}

Lew, Michael S. ^{[1
]}

机构：

[1] Leiden Univ, LIACS Medialab, Leiden, Netherlands

[2] Natl Univ Def Technol, Coll Syst Engn, Changsha, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Visual question answering; language bias; forward/backward chaining; label smoothing;

D O I：

10.1109/ICASSP43922.2022.9746493

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Current VQA models are suffering from the problem of over-dependence on language bias, which severely reduces their robustness in real-world scenarios. In this paper, we analyze VQA models from the view of forward/backward chaining in the inference engine, and propose to enhance their robustness via a novel Bidirectional Chaining (VQA-BC) framework. Specifically, we introduce a backward chaining with hard-negative contrastive learning to reason from the consequence (answers) to generate crucial known facts (question-related visual region features). Furthermore, to alleviate the over-confident problem in answer prediction (forward chaining), we present a novel introspective regularization to connect forward and backward chaining with label smoothing. Extensive experiments verify that VQA-BC not only effectively overcomes language bias on out-of-distribution dataset, but also alleviates the over-correct problem caused by ensemble-based method on in-distribution dataset. Compared with competitive debiasing strategies, our method achieves state-of-the-art performance to reduce language bias on VQA-CP v2 dataset.

引用

页码：4833 / 4837

页数：5

共 28 条

[1] Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
Agrawal, Aishwarya
Batra, Dhruv
Parikh, Devi
Kembhavi, Aniruddha
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4971 - 4980
[2] Al-Ajlan A., 2015, INT J MACHINE LEARNI
[3] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
Anderson, Peter
Wu, Qi
Teney, Damien
Bruce, Jake
Johnson, Mark
Sunderhauf, Niko
Reid, Ian
Gould, Stephen
van den Hengel, Anton
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
[4] [Anonymous], 2018, NIPS
[5] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[6] MUTAN: Multimodal Tucker Fusion for Visual Question Answering
Ben-younes, Hedi
Cadene, Remi
Cord, Matthieu
Thome, Nicolas
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2631 - 2639
[7] Cadene Remi., 2019, NIPS
[8] Chen L, 2020, PROC CVPR IEEE, P10797, DOI 10.1109/CVPR42600.2020.01081
[9] Cheng Z., 2021, IJCAI
[10] Clark Christopher, 2019, EMNLP

← 1 2 3 →