Exploring Spatial-Based Position Encoding for Image Captioning

被引:2
|
作者
Yang, Xiaobao [1 ,2 ]
He, Shuai [2 ]
Wu, Junsheng [3 ]
Yang, Yang [2 ]
Hou, Zhiqiang [2 ]
Ma, Sugang [2 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
[2] Xian Univ Posts & Telecommun, Sch Comp Sci & Technol, Xian 710061, Peoples R China
[3] Northwestern Polytech Univ, Sch Software, Xian 710072, Peoples R China
关键词
position encoding; image captioning; transformer;
D O I
10.3390/math11214550
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Image captioning has become a hot topic in artificial intelligence research and sits at the intersection of computer vision and natural language processing. Most recent imaging captioning models have adopted an "encoder + decoder" architecture, in which the encoder is employed generally to extract the visual feature, while the decoder generates the descriptive sentence word by word. However, the visual features need to be flattened into sequence form before being forwarded to the decoder, and this results in the loss of the 2D spatial position information of the image. This limitation is particularly pronounced in the Transformer architecture since it is inherently not position-aware. Therefore, in this paper, we propose a simple coordinate-based spatial position encoding method (CSPE) to remedy this deficiency. CSPE firstly creates the 2D position coordinates for each feature pixel, and then encodes them by row and by column separately via trainable or hard encoding, effectively strengthening the position representation of visual features and enriching the generated description sentences. In addition, in order to reduce the time cost, we also explore a diagonal-based spatial position encoding (DSPE) approach. Compared with CSPE, DSPE is slightly inferior in performance but has a faster calculation speed. Extensive experiments on the MS COCO 2014 dataset demonstrate that CSPE and DSPE can significantly enhance the spatial position representation of visual features. CSPE, in particular, demonstrates BLEU-4 and CIDEr metrics improved by 1.6% and 5.7%, respectively, compared with a baseline model without sequence-based position encoding, and also outperforms current sequence-based position encoding approaches by a significant margin. In addition, the robustness and plug-and-play ability of the proposed method are validated based on a medical captioning generation model.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Image Captioning Based on An Improved Transformer with IoU Position Encoding
    Li, Yazhou
    Shi, Yihui
    Liu, Yun
    Li, Ruifan
    Ma, Zhanyu
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 2066 - 2071
  • [2] Position-aware image captioning with spatial relation
    Duan, Yiqun
    Wang, Zhen
    Wang, Jingya
    Wang, Yu-Kai
    Lin, Chin-Teng
    Neurocomputing, 2022, 497 : 28 - 38
  • [3] Position-aware image captioning with spatial relation
    Duan, Yiqun
    Wang, Zhen
    Wang, Jingya
    Wang, Yu-Kai
    Lin, Chin-Teng
    NEUROCOMPUTING, 2022, 497 : 28 - 38
  • [4] Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning
    Liu, Anli
    Meng, Lingwu
    Xiao, Liang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 20026 - 20040
  • [5] Memory positional encoding for image captioning
    Yang, Xiaobao
    He, Shuai
    Zhang, Jie
    Ma, Sugang
    Hou, Zhiqiang
    Sun, Wei
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2025, 130
  • [6] Effective Multimodal Encoding for Image Paragraph Captioning
    Nguyen, Thanh-Son
    Fernando, Basura
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 6381 - 6395
  • [7] Exploring Visual Relationship for Image Captioning
    Yao, Ting
    Pan, Yingwei
    Li, Yehao
    Mei, Tao
    COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 711 - 727
  • [8] Graph-based image captioning with semantic and spatial features
    Parseh, Mohammad Javad
    Ghadiri, Saeed
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2025, 133
  • [9] Spatial-Based Feature for Locating Objects
    Cao, Lu
    Kobayashi, Yoshinori
    Kuno, Yoshinori
    INTELLIGENT COMPUTING THEORIES AND APPLICATIONS, ICIC 2012, 2012, 7390 : 128 - 137
  • [10] A multiobjective metaheuristic for spatial-based redistricting
    Wei, Bong Chin
    APPLIED SOFT COMPUTING TECHNOLOGIES: THE CHALLENGE OF COMPLEXITY, 2006, 34 : 235 - 250