Image captioning for effective use of language models in knowledge-based visual question answering

被引:30
作者
Salaberria, Ander [1 ]
Azkune, Gorka [1 ]
Lacalle, Oier Lopez de [1 ]
Soroa, Aitor [1 ]
Agirre, Eneko [1 ]
机构
[1] Univ Basque Country UPV EHU, HiTZ Basque Ctr Language Technol, Ixa NLP Grp, M Lardizabal 1, Donostia San Sebastian 20018, Basque Country, Spain
关键词
Visual question answering; Image captioning; Language models; Deep learning;
D O I
10.1016/j.eswa.2022.118669
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.
引用
收藏
页数:10
相关论文
共 48 条
[1]   Multi-modality helps in crisis management: An attention-based deep learning approach of leveraging text for image classification [J].
Ahmad, Zishan ;
Jindal, Raghav ;
Mukuntha, N. S. ;
Ekbal, Asif ;
Bhattachharyya, Pushpak .
EXPERT SYSTEMS WITH APPLICATIONS, 2022, 195
[2]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[3]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[4]   DBpedia: A nucleus for a web of open data [J].
Auer, Soeren ;
Bizer, Christian ;
Kobilarov, Georgi ;
Lehmann, Jens ;
Cyganiak, Richard ;
Ives, Zachary .
SEMANTIC WEB, PROCEEDINGS, 2007, 4825 :722-+
[5]  
Ba JL, 2016, arXiv
[6]   Flower classification with modified multimodal convolutional neural networks [J].
Bae, Kang Il ;
Park, Junghoon ;
Lee, Jongga ;
Lee, Yungseop ;
Lim, Changwon .
EXPERT SYSTEMS WITH APPLICATIONS, 2020, 159
[7]  
Bhakthavatsalam S, 2020, Arxiv, DOI arXiv:2006.07510
[8]  
Brown T. B., 2020, P ADV NEUR INF PROC
[9]  
Bugliarello E, 2021, Arxiv, DOI arXiv:2011.15124
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171