GLOVE-ING ATTENTION: A MULTI-MODAL NEURAL LEARNING APPROACH TO IMAGE CAPTIONING

被引:0
作者
Anundskas, Lars Halvor [1 ]
Afridi, Hina [1 ,3 ]
Tarekegn, Adane Nega [1 ]
Yamin, Muhammad Mudassar [2 ]
Ullah, Mohib [1 ]
Yamin, Saira [2 ]
Cheikh, Faouzi Alaya [1 ]
机构
[1] Norwegian Univ Sci & Technol NTNU, Dept Comp Sci, Trondheim, Norway
[2] Dept Management Sci, CUI Wah Campus, Wah Cantt, Pakistan
[3] Geno SA, Storhamargata 44, N-2317 Hamar, Norway
来源
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW | 2023年
关键词
Chest X-ray; Convolutional neural networks; attention; GloVe embeddings; gated recurrent units;
D O I
10.1109/ICASSPW59220.2023.10193011
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Articulating pictures using natural language is a complex undertaking within the realm of computer vision. The process of generating image captions involves producing depictions of images which can be achieved through advanced learning frameworks utilizing convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Yet, conventional RNNs face challenges such as gradient explosion and vanishing gradients, resulting in inferior outcomes when producing non-evocative sentences. In this paper, we proposed an encoder-decoder deep neural network to generate image captions using state-of-the-art backbone architecture EfficientNet as the encoder network. We used multimodal gated recurrent units (GrU) for the decoder, which incorporate GloVe word embeddings for the text data and visual attention for the image data. The network is trained on three different datasets, Indiana Chest X-ray, COCO and WIT, and the results are evaluated on the standard performance metrics of BLEU and METEOR. The quantitative results show that the network achieves promising results compared to the state-of-the-art models. The source code is publically available at https://bitbucket.org/larswise/ imagecaptioning/src/master/wit_pipeline/.
引用
收藏
页数:5
相关论文
共 50 条
  • [41] Deep Multi-lnstance Learning Using Multi-Modal Data for Diagnosis for Lymphocytosis
    Sahasrabudhe, Mihir
    Sujobert, Pierre
    Zacharaki, Evangelia, I
    Maurin, Eugenie
    Grange, Beatrice
    Jallades, Laurent
    Paragios, Nikos
    Vakalopoulou, Maria
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2021, 25 (06) : 2125 - 2136
  • [42] Attention based Multi-Modal New Product Sales Time-series Forecasting
    Ekambaram, Vijay
    Manglik, Kushagra
    Mukherjee, Sumanta
    Sajja, Surya Shravan Kumar
    Dwivedi, Satyam
    Raykar, Vikas
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3110 - 3118
  • [43] Automated brain tumor segmentation on multi-modal MR image using SegNet
    Salma Alqazzaz
    Xianfang Sun
    Xin Yang
    Len Nokes
    Computational Visual Media, 2019, 5 : 209 - 219
  • [44] A Novel Approach to Enhancing Multi-Modal Facial Recognition: Integrating Convolutional Neural Networks, Principal Component Analysis, and Sequential Neural Networks
    Abdul-Al, Mohamed
    Kyeremeh, George Kumi
    Qahwaji, Rami
    Ali, Nazar T.
    Abd-Alhameed, Raed A.
    IEEE ACCESS, 2024, 12 : 140823 - 140846
  • [45] Automated brain tumor segmentation on multi-modal MR image using SegNet
    Alqazzaz, Salma
    Sun, Xianfang
    Yang, Xin
    Nokes, Len
    COMPUTATIONAL VISUAL MEDIA, 2019, 5 (02) : 209 - 219
  • [46] Multi-modal facial expression feature based on deep-neural networks
    Wei Wei
    Qingxuan Jia
    Yongli Feng
    Gang Chen
    Ming Chu
    Journal on Multimodal User Interfaces, 2020, 14 : 17 - 23
  • [47] Contrasting Dual Transformer Architectures for Multi-Modal Remote Sensing Image Retrieval
    Al Rahhal, Mohamad M.
    Bencherif, Mohamed Abdelkader
    Bazi, Yakoub
    Alharbi, Abdullah
    Mekhalfi, Mohamed Lamine
    APPLIED SCIENCES-BASEL, 2023, 13 (01):
  • [48] Automated brain tumor segmentation on multi-modal MR image using SegNet
    Salma Alqazzaz
    Xianfang Sun
    Xin Yang
    Len Nokes
    Computational Visual Media, 2019, 5 (02) : 209 - 219
  • [49] Multi-modal fusion using Fine-tuned Self-attention and transfer learning for veracity analysis of web information
    Meel, Priyanka
    Vishwakarma, Dinesh Kumar
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 229
  • [50] Virtual Multi-modal Object Detection and Classification with Deep Convolutional Neural Networks
    Mitsakos, Nikolaos
    Papadakis, Manos
    WAVELETS AND SPARSITY XVIII, 2019, 11138