Fine-tuning your answers: a bag of tricks for improving VQA models

被引：0

作者：

Arroyo, Roberto ^{[1
]}

Alvarez, Sergio ^{[1
]}

Aller, Aitor ^{[1
]}

Bergasa, Luis M. ^{[2
]}

Ortiz, Miguel E. ^{[2
]}

机构：

[1] NielsenIQ, Madrid, Spain

[2] Univ Alcala UAH, Elect Dept, Madrid, Spain

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2022年 / 81卷 / 19期

关键词：

Computer vision; Natural language processing; Knowledge representation & reasoning; Visual question answering; Artificial intelligence;

D O I：

10.1007/s11042-021-11546-z

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper, one of the most novel topics in Deep Learning (DL) is explored: Visual Question Answering (VQA). This research area uses three of the most important fields in Artificial Intelligence (AI) to automatically provide natural language answers for questions that a user can ask about an image. These fields are: 1) Computer Vision (CV), 2) Natural Language Processing (NLP) and 3) Knowledge Representation & Reasoning (KR&R). Initially, a review of the state of art in VQA and our contributions to it are discussed. Then, we build upon the ideas provided by Pythia, which is one of the most outstanding approaches. Therefore, a study of the Pythia's architecture is carried out with the aim of presenting varied enhancements with respect to the original proposal in order to fine-tune models using a bag of tricks. Several training strategies are compared to increase the global accuracy and understand the limitations associated with VQA models. Extended results check the impact of the different tricks over our enhanced architecture, jointly with additional qualitative results.

引用

页码：26889 / 26913

页数：25

共 34 条

[31] An Empirical Evaluation of the Zero-Shot, Few-Shot, and Traditional Fine-Tuning Based Pretrained Language Models for Sentiment Analysis in Software Engineering
Shafikuzzaman, Md
Islam, Md Rakibul
Rolli, Alex C.
Akhter, Sharmin
Seliya, Naeem
IEEE ACCESS, 2024, 12 : 109714 - 109734
[32] Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study
Li, Fei
Jin, Yonghao
Liu, Weisong
Rawat, Bhanu Pratap Singh
Cai, Pengshan
Yu, Hong
JMIR MEDICAL INFORMATICS, 2019, 7 (03)
[33] OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models
Jenish Maharjan
Anurag Garikipati
Navan Preet Singh
Leo Cyrus
Mayank Sharma
Madalina Ciobanu
Gina Barnes
Rahul Thapa
Qingqing Mao
Ritankar Das
Scientific Reports, 14 (1)
[34] OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models
Maharjan, Jenish
Garikipati, Anurag
Singh, Navan Preet
Cyrus, Leo
Sharma, Mayank
Ciobanu, Madalina
Barnes, Gina
Thapa, Rahul
Mao, Qingqing
Das, Ritankar
SCIENTIFIC REPORTS, 2024, 14 (01):

← 1 2 3 4 →