GLOVE-ING ATTENTION: A MULTI-MODAL NEURAL LEARNING APPROACH TO IMAGE CAPTIONING

被引:0
|
作者
Anundskas, Lars Halvor [1 ]
Afridi, Hina [1 ,3 ]
Tarekegn, Adane Nega [1 ]
Yamin, Muhammad Mudassar [2 ]
Ullah, Mohib [1 ]
Yamin, Saira [2 ]
Cheikh, Faouzi Alaya [1 ]
机构
[1] Norwegian Univ Sci & Technol NTNU, Dept Comp Sci, Trondheim, Norway
[2] Dept Management Sci, CUI Wah Campus, Wah Cantt, Pakistan
[3] Geno SA, Storhamargata 44, N-2317 Hamar, Norway
来源
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW | 2023年
关键词
Chest X-ray; Convolutional neural networks; attention; GloVe embeddings; gated recurrent units;
D O I
10.1109/ICASSPW59220.2023.10193011
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Articulating pictures using natural language is a complex undertaking within the realm of computer vision. The process of generating image captions involves producing depictions of images which can be achieved through advanced learning frameworks utilizing convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Yet, conventional RNNs face challenges such as gradient explosion and vanishing gradients, resulting in inferior outcomes when producing non-evocative sentences. In this paper, we proposed an encoder-decoder deep neural network to generate image captions using state-of-the-art backbone architecture EfficientNet as the encoder network. We used multimodal gated recurrent units (GrU) for the decoder, which incorporate GloVe word embeddings for the text data and visual attention for the image data. The network is trained on three different datasets, Indiana Chest X-ray, COCO and WIT, and the results are evaluated on the standard performance metrics of BLEU and METEOR. The quantitative results show that the network achieves promising results compared to the state-of-the-art models. The source code is publically available at https://bitbucket.org/larswise/ imagecaptioning/src/master/wit_pipeline/.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Multi-Modal Retinal Image Classification With Modality-Specific Attention Network
    He, Xingxin
    Deng, Ying
    Fang, Leyuan
    Peng, Qinghua
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2021, 40 (06) : 1591 - 1602
  • [2] Deep Convolutional Neural Network for Multi-Modal Image Restoration and Fusion
    Deng, Xin
    Dragotti, Pier Luigi
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (10) : 3333 - 3348
  • [3] Improved Multi-modal Image Fusion with Attention and Dense Networks: Visual and Quantitative Evaluation
    Banerjee, Ankan
    Patra, Dipti
    Roy, Pradipta
    COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT III, 2024, 2011 : 237 - 248
  • [4] Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
    Wei, Kaimin
    Zhou, Zhibo
    IEEE ACCESS, 2020, 8 (08): : 96237 - 96248
  • [5] Attention-based multi-modal learning for aircraft engine fan fault diagnosis
    Zhu, Jingjing
    Liang, Sicong
    Ma, Zhaokai
    Huang, Xun
    AEROSPACE SCIENCE AND TECHNOLOGY, 2025, 162
  • [6] Multi-Stage Fusion and Multi-Source Attention Network for Multi-Modal Remote Sensing Image Segmentation
    Zhao, Jiaqi
    Zhou, Yong
    Shi, Boyu
    Yang, Jingsong
    Zhang, Di
    Yao, Rui
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2021, 12 (06)
  • [7] Multi-modal cross-attention network for Alzheimer's disease diagnosis with multi data
    Zhang, Jin
    He, Xiaohai
    Liu, Yan
    Cai, Qingyan
    Chen, Honggang
    Qing, Linbo
    COMPUTERS IN BIOLOGY AND MEDICINE, 2023, 162
  • [8] Quaternion Cross-Modality Spatial Learning for Multi-Modal Medical Image Segmentation
    Chen, Junyang
    Huang, Guoheng
    Yuan, Xiaochen
    Zhong, Guo
    Zheng, Zewen
    Pun, Chi-Man
    Zhu, Jian
    Huang, Zhixin
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (03) : 1412 - 1423
  • [9] Attention-based multi-modal fusion sarcasm detection
    Liu, Jing
    Tian, Shengwei
    Yu, Long
    Long, Jun
    Zhou, Tiejun
    Wang, Bo
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 44 (02) : 2097 - 2108
  • [10] Multi-Modal Convolutional Neural Networks for Activity Recognition
    Ha, Sojeong
    Yun, Jeong-Min
    Choi, Seungjin
    2015 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC 2015): BIG DATA ANALYTICS FOR HUMAN-CENTRIC SYSTEMS, 2015, : 3017 - 3022