Fine-tuning your answers: a bag of tricks for improving VQA models

被引:0
|
作者
Arroyo, Roberto [1 ]
Alvarez, Sergio [1 ]
Aller, Aitor [1 ]
Bergasa, Luis M. [2 ]
Ortiz, Miguel E. [2 ]
机构
[1] NielsenIQ, Madrid, Spain
[2] Univ Alcala UAH, Elect Dept, Madrid, Spain
关键词
Computer vision; Natural language processing; Knowledge representation & reasoning; Visual question answering; Artificial intelligence;
D O I
10.1007/s11042-021-11546-z
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, one of the most novel topics in Deep Learning (DL) is explored: Visual Question Answering (VQA). This research area uses three of the most important fields in Artificial Intelligence (AI) to automatically provide natural language answers for questions that a user can ask about an image. These fields are: 1) Computer Vision (CV), 2) Natural Language Processing (NLP) and 3) Knowledge Representation & Reasoning (KR&R). Initially, a review of the state of art in VQA and our contributions to it are discussed. Then, we build upon the ideas provided by Pythia, which is one of the most outstanding approaches. Therefore, a study of the Pythia's architecture is carried out with the aim of presenting varied enhancements with respect to the original proposal in order to fine-tune models using a bag of tricks. Several training strategies are compared to increase the global accuracy and understand the limitations associated with VQA models. Extended results check the impact of the different tricks over our enhanced architecture, jointly with additional qualitative results.
引用
收藏
页码:26889 / 26913
页数:25
相关论文
共 34 条
  • [31] An Empirical Evaluation of the Zero-Shot, Few-Shot, and Traditional Fine-Tuning Based Pretrained Language Models for Sentiment Analysis in Software Engineering
    Shafikuzzaman, Md
    Islam, Md Rakibul
    Rolli, Alex C.
    Akhter, Sharmin
    Seliya, Naeem
    IEEE ACCESS, 2024, 12 : 109714 - 109734
  • [32] Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study
    Li, Fei
    Jin, Yonghao
    Liu, Weisong
    Rawat, Bhanu Pratap Singh
    Cai, Pengshan
    Yu, Hong
    JMIR MEDICAL INFORMATICS, 2019, 7 (03)
  • [33] OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models
    Jenish Maharjan
    Anurag Garikipati
    Navan Preet Singh
    Leo Cyrus
    Mayank Sharma
    Madalina Ciobanu
    Gina Barnes
    Rahul Thapa
    Qingqing Mao
    Ritankar Das
    Scientific Reports, 14 (1)
  • [34] OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models
    Maharjan, Jenish
    Garikipati, Anurag
    Singh, Navan Preet
    Cyrus, Leo
    Sharma, Mayank
    Ciobanu, Madalina
    Barnes, Gina
    Thapa, Rahul
    Mao, Qingqing
    Das, Ritankar
    SCIENTIFIC REPORTS, 2024, 14 (01):