A Comparative Evaluation of Transformer-Based Vision Encoder-Decoder Models for Brazilian Portuguese Image Captioning

被引:0
作者
Bromonschenkel, Gabriel [1 ]
Oliveira, Hilark [1 ]
Paixao, Thiago M. [1 ]
机构
[1] Inst Fed Espirito Santo IFES, Programa Posgrad Comp Aplicada PPComp, Serra, Brazil
来源
2024 37TH SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES, SIBGRAPI 2024 | 2024年
关键词
D O I
10.1109/SIBGRAPI62404.2024.10716325
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning refers to the process of creating a natural language description for one or more images. This task has several practical applications, from aiding in medical diagnoses through image descriptions to promoting social inclusion by providing visual context to people with impairments. Despite recent progress, especially in English, low-resource languages like Brazilian Portuguese face a shortage of datasets, models, and studies. This work seeks to contribute to this context by fine-tuning and investigating the performance of vision language models based on the Transformer architecture in Brazilian Portuguese. We leverage pre-trained vision model checkpoints (ViT, Swin, and DeiT) and neural language models (BERTimbau, DistilBERTimbau, and GPorTuguese-2). Several experiments were carried out to compare the efficiency of different model combinations using the #PraCegoVer-63K, a native Portuguese dataset, and a translated version of the Flickr30K dataset. The experimental results demonstrated that configurations using the Swin, DistilBERTimbau, and GPorTuguese-2 models generally achieved the best outcomes. Furthermore, the #PraCegoVer-63K dataset presents a series of challenges, such as descriptions made up of multiple sentences and the presence of proper names of places and people, which significantly decrease the performance of the investigated models.
引用
收藏
页码:235 / 240
页数:6
相关论文
共 28 条
  • [1] Abdin M, 2024, Arxiv, DOI [arXiv:2404.14219, 10.48550/arXiv.2404.14219, DOI 10.48550/ARXIV.2404.14219]
  • [2] Banerjee S., 2005, P ACL WORKSH INTR EX, P65
  • [3] Towards Transfer Learning Techniques-BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study
    Barbon, Rafael Silva
    Akabane, Ademar Takeo
    [J]. SENSORS, 2022, 22 (21)
  • [4] Bencke L., 2024, P 2024 JOINT INT C C, P9050
  • [5] Chen X, 2023, Arxiv, DOI arXiv:2310.09199
  • [6] de Alencar RS, 2024, Arxiv, DOI arXiv:2402.05106
  • [7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [8] #PraCegoVer: A Large Dataset for Image Captioning in Portuguese
    dos Santos, Gabriel Oliveira
    Colombini, Esther Luna
    Avila, Sandra
    [J]. DATA, 2022, 7 (02)
  • [9] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
  • [10] Towards Image Captioning for the Portuguese Language: Evaluation on a Translated Dataset
    Gondim, Joao
    Claro, Daniela Barreiro
    Souza, Marlo
    [J]. ICEIS: PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS - VOL 1, 2022, : 384 - 393