GLOVE-ING ATTENTION: A MULTI-MODAL NEURAL LEARNING APPROACH TO IMAGE CAPTIONING

被引:0
|
作者
Anundskas, Lars Halvor [1 ]
Afridi, Hina [1 ,3 ]
Tarekegn, Adane Nega [1 ]
Yamin, Muhammad Mudassar [2 ]
Ullah, Mohib [1 ]
Yamin, Saira [2 ]
Cheikh, Faouzi Alaya [1 ]
机构
[1] Norwegian Univ Sci & Technol NTNU, Dept Comp Sci, Trondheim, Norway
[2] Dept Management Sci, CUI Wah Campus, Wah Cantt, Pakistan
[3] Geno SA, Storhamargata 44, N-2317 Hamar, Norway
来源
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW | 2023年
关键词
Chest X-ray; Convolutional neural networks; attention; GloVe embeddings; gated recurrent units;
D O I
10.1109/ICASSPW59220.2023.10193011
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Articulating pictures using natural language is a complex undertaking within the realm of computer vision. The process of generating image captions involves producing depictions of images which can be achieved through advanced learning frameworks utilizing convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Yet, conventional RNNs face challenges such as gradient explosion and vanishing gradients, resulting in inferior outcomes when producing non-evocative sentences. In this paper, we proposed an encoder-decoder deep neural network to generate image captions using state-of-the-art backbone architecture EfficientNet as the encoder network. We used multimodal gated recurrent units (GrU) for the decoder, which incorporate GloVe word embeddings for the text data and visual attention for the image data. The network is trained on three different datasets, Indiana Chest X-ray, COCO and WIT, and the results are evaluated on the standard performance metrics of BLEU and METEOR. The quantitative results show that the network achieves promising results compared to the state-of-the-art models. The source code is publically available at https://bitbucket.org/larswise/ imagecaptioning/src/master/wit_pipeline/.
引用
收藏
页数:5
相关论文
共 50 条
  • [31] Multi-modal Quality Prediction Algorithm Based on Anomalous Energy Tracking Attention
    Li, Haoyong
    Zhang, Qifei
    Li, Wenjuan
    Liang, Xiubo
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT II, ICIC 2024, 2024, 14876 : 150 - 162
  • [32] Multi-Level Multi-Modal Cross-Attention Network for Fake News Detection
    Ying, Long
    Yu, Hui
    Wang, Jinguang
    Ji, Yongze
    Qian, Shengsheng
    IEEE ACCESS, 2021, 9 : 132363 - 132373
  • [33] Air Pollution Prediction with Multi-Modal Data and Deep Neural Networks
    Kalajdjieski, Jovan
    Zdravevski, Eftim
    Corizzo, Roberto
    Lameski, Petre
    Kalajdziski, Slobodan
    Pires, Ivan Miguel
    Garcia, Nuno M.
    Trajkovik, Vladimir
    REMOTE SENSING, 2020, 12 (24) : 1 - 19
  • [34] Shared-Specific Feature Learning With Bottleneck Fusion Transformer for Multi-Modal Whole Slide Image Analysis
    Wang, Zhihua
    Yu, Lequan
    Ding, Xin
    Liao, Xuehong
    Wang, Liansheng
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (11) : 3374 - 3383
  • [35] Bi-Modal Learning With Channel-Wise Attention for Multi-Label Image Classification
    Li, Peng
    Chen, Peng
    Xie, Yonghong
    Zhang, Dezheng
    IEEE ACCESS, 2020, 8 : 9965 - 9977
  • [36] A Multi-Modal Neural Embeddings Approach for Detecting Mobile Counterfeit Apps: A Case Study on Google Play Store
    Karunanayake, Naveen
    Rajasegaran, Jathushan
    Gunathillake, Ashanie
    Seneviratne, Suranga
    Jourjon, Guillaume
    IEEE TRANSACTIONS ON MOBILE COMPUTING, 2022, 21 (01) : 16 - 30
  • [37] LEARNING SOCIAL COMPLIANT MULTI-MODAL DISTRIBUTIONS OF HUMAN PATH IN CROWDS
    Shi, Xiaodan
    Zhang, Haoran
    Yuan, Wei
    Huang, Dou
    Guo, Zhiling
    Shibasaki, Ryosuke
    XXIV ISPRS CONGRESS IMAGING TODAY, FORESEEING TOMORROW, COMMISSION IV, 2022, 5-4 : 91 - 98
  • [38] Deep Multi-lnstance Learning Using Multi-Modal Data for Diagnosis for Lymphocytosis
    Sahasrabudhe, Mihir
    Sujobert, Pierre
    Zacharaki, Evangelia, I
    Maurin, Eugenie
    Grange, Beatrice
    Jallades, Laurent
    Paragios, Nikos
    Vakalopoulou, Maria
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2021, 25 (06) : 2125 - 2136
  • [39] Efficient Channel Attention Based Encoder-Decoder Approach for Image Captioning in Hindi
    Mishra, Santosh Kumar
    Rai, Gaurav
    Saha, Sriparna
    Bhattacharyya, Pushpak
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (03)
  • [40] Air Quality Prediction with 1-Dimensional Convolution and Attention on Multi-modal Features
    Choi, Junyoung
    Kim, Joonyoung
    Jung, Kyomin
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2021), 2021, : 196 - 202