ETransCap: efficient transformer for image captioning

被引:0
作者
Mundu, Albert [1 ]
Singh, Satish Kumar [1 ]
Dubey, Shiv Ram [1 ]
机构
[1] IIIT Allahabad, Dept IT, Comp Vis & Biometr Lab CVBL, Allahabad, India
关键词
Deep learning; Natural language processing; Image captioning; Scene understanding; Transformers; Efficient transformers; ATTENTION;
D O I
10.1007/s10489-024-05739-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning is a challenging task in computer vision that automatically generates a textual description of an image by integrating visual and linguistic information, as the generated captions must accurately describe the image's content while also adhering to the conventions of natural language. We adopt the encoder-decoder framework employed by various CNN-RNN-based models for image captioning in the past few years. Recently, we observed that the CNN-Transformer-based models have achieved great success and surpassed traditional CNN-RNN-based models in the area. Many researchers have concentrated on Transformers, exploring and uncovering its vast possibilities. Unlike conventional CNN-RNN-based models in image captioning, transformer-based models have achieved notable success and offer the benefit of handling longer input sequences more efficiently. However, they are resource-intensive to train and deploy, particularly for large-scale tasks or for tasks that require real-time processing. In this work, we introduce a lightweight and efficient transformer-based model called the Efficient Transformer Captioner (ETransCap), which consumes fewer computation resources to generate captions. Our model operates in linear complexity and has been trained and tested on MS-COCO dataset. Comparisons with existing state-of-the-art models show that ETransCap achieves promising results. Our results support the potential of ETransCap as a good approach for image captioning tasks in real-time applications. Code for this project will be available at https://github.com/albertmundu/etranscap.
引用
收藏
页码:10748 / 10762
页数:15
相关论文
共 47 条
  • [1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [2] Banerjee S., 2005, P ACL WORKSH INTR EX, V29, P65, DOI DOI 10.3115/1626355.1626389
  • [3] Beltagy I., 2020, ARXIV
  • [4] Chen X., 2015, ARXIV
  • [5] Choromanski K., 2021, INT C LEARNING REPRE
  • [6] Cornia M, 2020, PROC CVPR IEEE, P10575, DOI 10.1109/CVPR42600.2020.01059
  • [7] Stimulus-driven and concept-driven analysis for image caption generation
    Ding, Songtao
    Qu, Shiru
    Xi, Yuling
    Wan, Shaohua
    [J]. NEUROCOMPUTING, 2020, 398 : 520 - 530
  • [8] Dosovitskiy A., 2021, IMAGE IS WORTH 16 16, DOI DOI 10.48550/ARXIV.2010.11929
  • [9] El-Nouby A, 2021, ADV NEUR IN
  • [10] Herdade S, 2019, ADV NEUR IN, V32