A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA

被引：25

作者：

Guo, Yangyang ^{[1
]}

Nie, Liqiang ^{[2
]}

Wong, Yongkang ^{[1
]}

Liu, Yibing ^{[3
]}

Cheng, Zhiyong ^{[4
]}

Kankanhalli, Mohan ^{[1
]}

机构：

[1] Natl Univ Singapore, Singapore, Singapore

[2] Harbin Inst Technol, Shenzhen, Peoples R China

[3] City Univ Hong Kong, Hong Kong, Peoples R China

[4] Qilu Univ Technol, Shandong Acad Sci, Jinan, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

新加坡国家研究基金会;

关键词：

Visual Question Answering; Knowledge Integration; Modal Fusion;

D O I：

10.1145/3503161.3547870

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Knowledge-based Visual Question Answering (VQA) expects models to rely on external knowledge for robust answer prediction. Though significant it is, this paper discovers several leading factors impeding the advancement of current state-of-the-art methods. On the one hand, methods which exploit the explicit knowledge take the knowledge as a complement for the coarsely trained VQA model. Despite their effectiveness, these approaches often suffer from noise incorporation and error propagation. On the other hand, pertaining to the implicit knowledge, the multi-modal implicit knowledge for knowledge-based VQA still remains largely unexplored. This work presents a unified end-to-end retriever-reader framework towards knowledge-based VQA. In particular, we shed light on the multi-modal implicit knowledge from vision-language pre-training models to mine its potential in knowledge reasoning. As for the noise problem encountered by the retrieval operation on explicit knowledge, we design a novel scheme to create pseudo labels for effective knowledge supervision. This scheme is able to not only provide guidance for knowledge retrieval, but also drop these instances potentially error-prone towards question answering. To validate the effectiveness of the proposed method, we conduct extensive experiments on the benchmark dataset. The experimental results reveal that our method outperforms existing baselines by a noticeable margin. Beyond the reported numbers, this paper further spawns several insights on knowledge utilization for future research with some empirical findings.

引用

页码：2061 / 2069

页数：9

共 37 条

[1] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].

Anderson, Peter ;

Wu, Qi ;

Teney, Damien ;

Bruce, Jake ;

Johnson, Mark ;

Sunderhauf, Niko ;

Reid, Ian ;

Gould, Stephen ;

van den Hengel, Anton .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683

[2]

[Anonymous], 2018, ANAL MACHINE INTELLI, DOI DOI 10.1109/TPAMI.2017.2754246

[3]

[Anonymous], 2019, ICLR, DOI DOI 10.1159/000501710

[4] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[5] MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].

Ben-younes, Hedi ;

Cadene, Remi ;

Cord, Matthieu ;

Thome, Nicolas .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639

[6]

Brown TB, 2020, ADV NEUR IN, V33

[7]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[8]

Dosovitskiy A., 2020, INT C LEARN REPR

[9]

Gardères F, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P489

[10] Focal and Composed Vision-semantic Modeling for Visual Question Answering [J].

Han, Yudong ;

Guo, Yangyang ;

Yin, Jianhua ;

Liu, Meng ;

Hu, Yupeng ;

Nie, Liqiang .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :4528-4536

← 1 2 3 4 →