Exploring Spatial-Based Position Encoding for Image Captioning

被引:2
|
作者
Yang, Xiaobao [1 ,2 ]
He, Shuai [2 ]
Wu, Junsheng [3 ]
Yang, Yang [2 ]
Hou, Zhiqiang [2 ]
Ma, Sugang [2 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
[2] Xian Univ Posts & Telecommun, Sch Comp Sci & Technol, Xian 710061, Peoples R China
[3] Northwestern Polytech Univ, Sch Software, Xian 710072, Peoples R China
关键词
position encoding; image captioning; transformer;
D O I
10.3390/math11214550
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Image captioning has become a hot topic in artificial intelligence research and sits at the intersection of computer vision and natural language processing. Most recent imaging captioning models have adopted an "encoder + decoder" architecture, in which the encoder is employed generally to extract the visual feature, while the decoder generates the descriptive sentence word by word. However, the visual features need to be flattened into sequence form before being forwarded to the decoder, and this results in the loss of the 2D spatial position information of the image. This limitation is particularly pronounced in the Transformer architecture since it is inherently not position-aware. Therefore, in this paper, we propose a simple coordinate-based spatial position encoding method (CSPE) to remedy this deficiency. CSPE firstly creates the 2D position coordinates for each feature pixel, and then encodes them by row and by column separately via trainable or hard encoding, effectively strengthening the position representation of visual features and enriching the generated description sentences. In addition, in order to reduce the time cost, we also explore a diagonal-based spatial position encoding (DSPE) approach. Compared with CSPE, DSPE is slightly inferior in performance but has a faster calculation speed. Extensive experiments on the MS COCO 2014 dataset demonstrate that CSPE and DSPE can significantly enhance the spatial position representation of visual features. CSPE, in particular, demonstrates BLEU-4 and CIDEr metrics improved by 1.6% and 5.7%, respectively, compared with a baseline model without sequence-based position encoding, and also outperforms current sequence-based position encoding approaches by a significant margin. In addition, the robustness and plug-and-play ability of the proposed method are validated based on a medical captioning generation model.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] DEEP CONTEXT-ENCODING NETWORK FOR RETINAL IMAGE CAPTIONING
    Huang, Jia-Hong
    Wu, Ting-Wei
    Yang, Chao-Han Huck
    Worring, Marcel
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 3762 - 3766
  • [32] EXPLORING DUAL STREAM GLOBAL INFORMATION FOR IMAGE CAPTIONING
    Xian, Tiantao
    Li, Zhixin
    Chen, Tianyu
    Ma, Huifang
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4458 - 4462
  • [33] Exploring region features in remote sensing image captioning
    Zhao, Kai
    Xiong, Wei
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 127
  • [34] Exploring Data and Models in SAR Ship Image Captioning
    Zhao, Kai
    Xiong, Wei
    IEEE ACCESS, 2022, 10 : 91150 - 91159
  • [35] Exploring the Impact of Vision Features in News Image Captioning
    Zhang, Junzhe
    Wan, Xiaojun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 12923 - 12936
  • [36] Spatial-based modeling of tactical communication networks
    Li J.
    Lü X.
    Tan Y.-J.
    Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2010, 32 (07): : 1456 - 1461
  • [37] Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning
    Li, Jingyu
    Mao, Zhendong
    Li, Hao
    Chen, Weidong
    Zhang, Yongdong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [38] Dual-Spatial Normalized Transformer for image captioning
    Hu, Juntao
    Yang, You
    An, Yongzhi
    Yao, Lu
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
  • [39] PCATNet: Position-Class Awareness Transformer for Image Captioning
    Tang, Ziwei
    Yi, Yaohua
    Yu, Changhui
    Yin, Aiguo
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (03): : 6007 - 6022
  • [40] A Multiobjective Spatial-based Zone Design Model (MoSZoD)
    Bong, C.W. (cwbong@fit.unimas.my), (Marine Technology Society Inc.):