Image captioning for effective use of language models in knowledge-based visual question answering

被引：30

作者：

Salaberria, Ander ^{[1
]}

Azkune, Gorka ^{[1
]}

Lacalle, Oier Lopez de ^{[1
]}

Soroa, Aitor ^{[1
]}

Agirre, Eneko ^{[1
]}

机构：

[1] Univ Basque Country UPV EHU, HiTZ Basque Ctr Language Technol, Ixa NLP Grp, M Lardizabal 1, Donostia San Sebastian 20018, Basque Country, Spain

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2023年 / 212卷

关键词：

Visual question answering; Image captioning; Language models; Deep learning;

D O I：

10.1016/j.eswa.2022.118669

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.

引用

页数：10

共 48 条

[1] Multi-modality helps in crisis management: An attention-based deep learning approach of leveraging text for image classification [J].

Ahmad, Zishan ;

Jindal, Raghav ;

Mukuntha, N. S. ;

Ekbal, Asif ;

Bhattachharyya, Pushpak .

EXPERT SYSTEMS WITH APPLICATIONS, 2022, 195

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[4] DBpedia: A nucleus for a web of open data [J].

Auer, Soeren ;

Bizer, Christian ;

Kobilarov, Georgi ;

Lehmann, Jens ;

Cyganiak, Richard ;

Ives, Zachary .

SEMANTIC WEB, PROCEEDINGS, 2007, 4825 :722-+

[5]

Ba JL, 2016, arXiv

[6] Flower classification with modified multimodal convolutional neural networks [J].

Bae, Kang Il ;

Park, Junghoon ;

Lee, Jongga ;

Lee, Yungseop ;

Lim, Changwon .

EXPERT SYSTEMS WITH APPLICATIONS, 2020, 159

[7]

Bhakthavatsalam S, 2020, Arxiv, DOI arXiv:2006.07510

[8]

Brown T. B., 2020, P ADV NEUR INF PROC

[9]

Bugliarello E, 2021, Arxiv, DOI arXiv:2011.15124

[10]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 4 5 →