Fine-grained and Semantic-guided Visual Attention for Image Captioning

被引:9
|
作者
Zhang, Zongjian [1 ]
Wu, Qiang [1 ]
Wang, Yang [2 ]
Chen, Fang [2 ]
机构
[1] Univ Technol Sydney, Sydney, NSW, Australia
[2] CSIRO, Data61, Eveleigh, NSW, Australia
关键词
D O I
10.1109/WACV.2018.00190
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Soft-attention is regarded as one of the representative methods for image captioning. Based on the end-to-end CNN-LSTM framework, it tries to link the relevant visual information on the image with the semantic representation in the text (i.e. captioning) for the first time. In recent years, there are several state-of-the-art methods published, which are motivated by this approach and include more elegant fine-tune operation. However, due to the constraints of CNN architecture, the given image is only segmented to fixed-resolution grid at a coarse level. The overall visual feature created for each grid cell indiscriminately fuses all inside objects and/or their portions. There is no semantic link among grid cells, although an object may be segmented into different grid cells. In addition, the large-area stuff (e.g. sky and beach) cannot be represented in the current methods. To tackle the problems above, this paper proposes a new model based on the FCN-LSTM framework which can segment the input image into a fine-grained grid. Moreover, the visual feature representing each grid cell is contributed only by the principal object or its portion in the corresponding cell. By adopting the pixel-wise labels (i.e. semantic segmentation), the visual representations of different grid cells are correlated to each other. In this way, a mechanism of fine-grained and semantic-guided visual attention is created, which can better link the relevant visual information with each semantic meaning inside the text through LSTM. Without using the elegant fine-tune, the comprehensive experiments show promising performance consistently across different evaluation metrics.
引用
收藏
页码:1709 / 1717
页数:9
相关论文
共 50 条
  • [1] High-Quality Image Captioning With Fine-Grained and Semantic-Guided Visual Attention
    Zhang, Zongjian
    Wu, Qiang
    Wang, Yang
    Chen, Fang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (07) : 1681 - 1693
  • [2] Semantic-Guided Information Alignment Network for Fine-Grained Image Recognition
    Wang, Shijie
    Wang, Zhihui
    Li, Haojie
    Chang, Jianlong
    Ouyang, Wanli
    Tian, Qi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (11) : 6558 - 6570
  • [3] Semantic-Guided Selective Representation for Image Captioning
    Li, Yinan
    Ma, Yiwei
    Zhou, Yiyi
    Yu, Xiao
    IEEE ACCESS, 2023, 11 : 14500 - 14510
  • [4] Attention-Guided Hierarchical Parsing for Fine-Grained Person-Centric Image Captioning
    Gu, Zhengcheng
    Jin, Jing
    IEEE ACCESS, 2024, 12 : 86293 - 86301
  • [5] Fine-Grained Features for Image Captioning
    Shao, Mengyue
    Feng, Jie
    Wu, Jie
    Zhang, Haixiang
    Zheng, Yayu
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (03): : 4697 - 4712
  • [6] SAM-GUIDED ENHANCED FINE-GRAINED ENCODING WITH MIXED SEMANTIC LEARNING FOR MEDICAL IMAGE CAPTIONING
    Zhang, Zhenyu
    Wang, Benlu
    Liang, Weijie
    Li, Yizhi
    Guo, Xuechen
    Wang, Guanhong
    Li, Shiyan
    Wang, Gaoang
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 1731 - 1735
  • [7] Leveraging Weighted Fine-Grained Cross-Graph Attention for Visual and Semantic Enhanced Video Captioning Network
    Verma, Deepali
    Haldar, Arya
    Dutta, Tanima
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 2465 - 2473
  • [8] A Sub-captions Semantic-Guided Network for Image Captioning
    Tian, Wei-Dong
    Zhu, Jun-jun
    Wu, Shuang
    Zhao, Zhong-Qiu
    Zhang, Yu-Zheng
    Zhang, Tian-yu
    INTELLIGENT COMPUTING METHODOLOGIES, PT III, 2022, 13395 : 367 - 379
  • [9] MASK GUIDED ATTENTION FOR FINE-GRAINED PATCHY IMAGE CLASSIFICATION
    Wang, Jun
    Yu, Xiaohan
    Gao, Yongsheng
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1044 - 1048
  • [10] Fine-grained Image Classification by Visual-Semantic Embedding
    Xu, Huapeng
    Qi, Guilin
    Li, Jingjing
    Wang, Meng
    Xu, Kang
    Gao, Huan
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1043 - 1049