Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

被引:12
作者
Zhang, Huawei [1 ]
Ma, Chengbo [1 ]
Jiang, Zhanjun [1 ]
Lian, Jing [1 ]
机构
[1] Lanzhou Jiaotong Univ, Elect & Informat Engn, Lanzhou 730000, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Semantics; Visualization; Data mining; Decoding; Task analysis; Logic gates; Bi-LSTM; image caption generation; semantic fusion; semantic similarity;
D O I
10.1109/ACCESS.2022.3232508
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The image caption generation algorithm necessitates the expression of image content using accurate natural language. Given the existing encoder-decoder algorithm structure, the decoder solely generates words one by one in a front-to-back order and is unable to analyze integral contextual information. This paper employs a Bi-LSTM (Bi-directional Long Short-Term Memory) structure, which not only draws on past information but also captures subsequent information, resulting in the prediction of image content subject to the context clues. The visual information is respectively fed into the F-LSTM decoder (forward LSTM decoder) and B-LSTM decoder (backward LSTM decoder) to extract semantic information, along with complementing semantic output. Specifically, the subsidiary attention mechanism S-Att acts between F-LSTM and B-LSTM, while the semantic information of B-LSTM and F-LSTM is extracted using the attention mechanism. Meanwhile, the semantic interaction is extracted pursuant to the similarity while aligning the hidden states, resulting in the output of the fused semantic information. We adopt a Bi-LSTM-s model capable of extracting contextual information and realizing finer-grained image captioning effectively. In the end, our model improved by 9.7% on the basis of the original LSTM. In addition, our model effectively solves the problem of inconsistent semantic information in the forward and backward direction of the simultaneous order, and gets a score of 37.5 on BLEU-4. The superiority of this approach is experimentally demonstrated on the MSCOCO dataset.
引用
收藏
页码:134 / 143
页数:10
相关论文
共 46 条
  • [11] Advanced Deep-Learning Techniques for Salient and Category-Specific Object Detection A survey
    Han, Junwei
    Zhang, Dingwen
    Cheng, Gong
    Liu, Nian
    Xu, Dong
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2018, 35 (01) : 84 - 100
  • [12] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [13] Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
  • [14] Huang ZH, 2015, Arxiv, DOI [arXiv:1508.01991, DOI 10.48550/ARXIV.1508.01991]
  • [15] Guiding the Long-Short Term Memory model for Image Caption Generation
    Jia, Xu
    Gavves, Efstratios
    Fernando, Basura
    Tuytelaars, Tinne
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2407 - 2415
  • [16] Words Matter: Scene Text for Image Classification and Retrieval
    Karaoglu, Sezer
    Tao, Ran
    Gevers, Theo
    Smeulders, Arnold W. M.
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (05) : 1063 - 1076
  • [17] Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932
  • [18] Imageability- and Length-Controllable Image Captioning
    Kastner, Marc A.
    Umemura, Kazuki
    Ide, Ichiro
    Kawanishi, Yasutomo
    Hirayama, Takatsugu
    Doman, Keisuke
    Deguchi, Daisuke
    Murase, Hiroshi
    Satoh, Shin'Ichi
    [J]. IEEE ACCESS, 2021, 9 (09): : 162951 - 162961
  • [19] Kingma DP, 2014, ADV NEUR IN, V27
  • [20] BabyTalk: Understanding and Generating Simple Image Descriptions
    Kulkarni, Girish
    Premraj, Visruth
    Ordonez, Vicente
    Dhar, Sagnik
    Li, Siming
    Choi, Yejin
    Berg, Alexander C.
    Berg, Tamara L.
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (12) : 2891 - 2903