Fine-grained and Semantic-guided Visual Attention for Image Captioning

被引:9
|
作者
Zhang, Zongjian [1 ]
Wu, Qiang [1 ]
Wang, Yang [2 ]
Chen, Fang [2 ]
机构
[1] Univ Technol Sydney, Sydney, NSW, Australia
[2] CSIRO, Data61, Eveleigh, NSW, Australia
关键词
D O I
10.1109/WACV.2018.00190
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Soft-attention is regarded as one of the representative methods for image captioning. Based on the end-to-end CNN-LSTM framework, it tries to link the relevant visual information on the image with the semantic representation in the text (i.e. captioning) for the first time. In recent years, there are several state-of-the-art methods published, which are motivated by this approach and include more elegant fine-tune operation. However, due to the constraints of CNN architecture, the given image is only segmented to fixed-resolution grid at a coarse level. The overall visual feature created for each grid cell indiscriminately fuses all inside objects and/or their portions. There is no semantic link among grid cells, although an object may be segmented into different grid cells. In addition, the large-area stuff (e.g. sky and beach) cannot be represented in the current methods. To tackle the problems above, this paper proposes a new model based on the FCN-LSTM framework which can segment the input image into a fine-grained grid. Moreover, the visual feature representing each grid cell is contributed only by the principal object or its portion in the corresponding cell. By adopting the pixel-wise labels (i.e. semantic segmentation), the visual representations of different grid cells are correlated to each other. In this way, a mechanism of fine-grained and semantic-guided visual attention is created, which can better link the relevant visual information with each semantic meaning inside the text through LSTM. Without using the elegant fine-tune, the comprehensive experiments show promising performance consistently across different evaluation metrics.
引用
收藏
页码:1709 / 1717
页数:9
相关论文
共 50 条
  • [21] Fine-grained attention for image caption generation
    Yan-Shuo Chang
    Multimedia Tools and Applications, 2018, 77 : 2959 - 2971
  • [22] Fine-grained attention for image caption generation
    Chang, Yan-Shuo
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (03) : 2959 - 2971
  • [23] Fine-Grained Semantic Image Synthesis with Object-Attention Generative Adversarial Network
    Wang, Min
    Lang, Congyan
    Liang, Liqian
    Feng, Songhe
    Wang, Tao
    Gao, Yutong
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2021, 12 (05)
  • [24] c-RNN: A Fine-Grained Language Model for Image Captioning
    Huang, Gengshi
    Hu, Haifeng
    NEURAL PROCESSING LETTERS, 2019, 49 (02) : 683 - 691
  • [25] c-RNN: A Fine-Grained Language Model for Image Captioning
    Gengshi Huang
    Haifeng Hu
    Neural Processing Letters, 2019, 49 : 683 - 691
  • [26] Fine-Grained Image Captioning With Global-Local Discriminative Objective
    Wu, Jie
    Chen, Tianshui
    Wu, Hefeng
    Yang, Zhi
    Luo, Guangchun
    Lin, Liang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2413 - 2427
  • [27] Fine-grained image emotion captioning based on Generative Adversarial Networks
    Yang, Chunmiao
    Wang, Yang
    Han, Liying
    Jia, Xiran
    Sun, Hebin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (34) : 81857 - 81875
  • [28] Wavelet and Adaptive Coordinate Attention Guided Fine-Grained Residual Network for Image Denoising
    Ding, Shifei
    Wang, Qidong
    Guo, Lili
    Li, Xuan
    Ding, Ling
    Wu, Xindong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 6156 - 6166
  • [29] Dual attention guided multi-scale CNN for fine-grained image classification
    Liu, Xiaozhang
    Zhang, Lifeng
    Li, Tao
    Wang, Dejian
    Wang, Zhaojie
    INFORMATION SCIENCES, 2021, 573 : 37 - 45
  • [30] Diversified Semantic Attention Model for Fine-Grained Entity Typing
    Hu, Yanfeng
    Qiao, Xue
    Xing, Luo
    Peng, Chen
    IEEE ACCESS, 2021, 9 (09): : 2251 - 2265