Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

被引:12
作者
Zhang, Huawei [1 ]
Ma, Chengbo [1 ]
Jiang, Zhanjun [1 ]
Lian, Jing [1 ]
机构
[1] Lanzhou Jiaotong Univ, Elect & Informat Engn, Lanzhou 730000, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Semantics; Visualization; Data mining; Decoding; Task analysis; Logic gates; Bi-LSTM; image caption generation; semantic fusion; semantic similarity;
D O I
10.1109/ACCESS.2022.3232508
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The image caption generation algorithm necessitates the expression of image content using accurate natural language. Given the existing encoder-decoder algorithm structure, the decoder solely generates words one by one in a front-to-back order and is unable to analyze integral contextual information. This paper employs a Bi-LSTM (Bi-directional Long Short-Term Memory) structure, which not only draws on past information but also captures subsequent information, resulting in the prediction of image content subject to the context clues. The visual information is respectively fed into the F-LSTM decoder (forward LSTM decoder) and B-LSTM decoder (backward LSTM decoder) to extract semantic information, along with complementing semantic output. Specifically, the subsidiary attention mechanism S-Att acts between F-LSTM and B-LSTM, while the semantic information of B-LSTM and F-LSTM is extracted using the attention mechanism. Meanwhile, the semantic interaction is extracted pursuant to the similarity while aligning the hidden states, resulting in the output of the fused semantic information. We adopt a Bi-LSTM-s model capable of extracting contextual information and realizing finer-grained image captioning effectively. In the end, our model improved by 9.7% on the basis of the original LSTM. In addition, our model effectively solves the problem of inconsistent semantic information in the forward and backward direction of the simultaneous order, and gets a score of 37.5 on BLEU-4. The superiority of this approach is experimentally demonstrated on the MSCOCO dataset.
引用
收藏
页码:134 / 143
页数:10
相关论文
共 46 条
  • [1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [2] SPICE: Semantic Propositional Image Caption Evaluation
    Anderson, Peter
    Fernando, Basura
    Johnson, Mark
    Gould, Stephen
    [J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
  • [3] Banerjee S., 2005, P ACL WORKSH INTR EX, P65
  • [4] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
    Chen, Long
    Zhang, Hanwang
    Xiao, Jun
    Nie, Liqiang
    Shao, Jian
    Liu, Wei
    Chua, Tat-Seng
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6298 - 6306
  • [5] Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN
    Chen, Tao
    Xu, Ruifeng
    He, Yulan
    Wang, Xuan
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2017, 72 : 221 - 230
  • [6] Cho KYHY, 2014, Arxiv, DOI [arXiv:1406.1078, DOI 10.48550/ARXIV.1406.1078]
  • [7] A CNN-BiLSTM based hybrid model for Indian language identification
    Das, Himanish Shekhar
    Roy, Pinki
    [J]. APPLIED ACOUSTICS, 2021, 182
  • [8] Every Picture Tells a Story: Generating Sentences from Images
    Farhadi, Ali
    Hejrati, Mohsen
    Sadeghi, Mohammad Amin
    Young, Peter
    Rashtchian, Cyrus
    Hockenmaier, Julia
    Forsyth, David
    [J]. COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 : 15 - +
  • [9] Implicit Discourse Relation Recognition via a BiLSTM-CNN Architecture With Dynamic Chunk-Based Max Pooling
    Guo, Fengyu
    He, Ruifang
    Dang, Jianwu
    [J]. IEEE ACCESS, 2019, 7 : 169281 - 169292
  • [10] Gupta A, 2012, LECT NOTES COMPUT SC, V7667, P196, DOI 10.1007/978-3-642-34500-5_24