Exploring Spatial-Based Position Encoding for Image Captioning

被引:2
|
作者
Yang, Xiaobao [1 ,2 ]
He, Shuai [2 ]
Wu, Junsheng [3 ]
Yang, Yang [2 ]
Hou, Zhiqiang [2 ]
Ma, Sugang [2 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
[2] Xian Univ Posts & Telecommun, Sch Comp Sci & Technol, Xian 710061, Peoples R China
[3] Northwestern Polytech Univ, Sch Software, Xian 710072, Peoples R China
关键词
position encoding; image captioning; transformer;
D O I
10.3390/math11214550
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Image captioning has become a hot topic in artificial intelligence research and sits at the intersection of computer vision and natural language processing. Most recent imaging captioning models have adopted an "encoder + decoder" architecture, in which the encoder is employed generally to extract the visual feature, while the decoder generates the descriptive sentence word by word. However, the visual features need to be flattened into sequence form before being forwarded to the decoder, and this results in the loss of the 2D spatial position information of the image. This limitation is particularly pronounced in the Transformer architecture since it is inherently not position-aware. Therefore, in this paper, we propose a simple coordinate-based spatial position encoding method (CSPE) to remedy this deficiency. CSPE firstly creates the 2D position coordinates for each feature pixel, and then encodes them by row and by column separately via trainable or hard encoding, effectively strengthening the position representation of visual features and enriching the generated description sentences. In addition, in order to reduce the time cost, we also explore a diagonal-based spatial position encoding (DSPE) approach. Compared with CSPE, DSPE is slightly inferior in performance but has a faster calculation speed. Extensive experiments on the MS COCO 2014 dataset demonstrate that CSPE and DSPE can significantly enhance the spatial position representation of visual features. CSPE, in particular, demonstrates BLEU-4 and CIDEr metrics improved by 1.6% and 5.7%, respectively, compared with a baseline model without sequence-based position encoding, and also outperforms current sequence-based position encoding approaches by a significant margin. In addition, the robustness and plug-and-play ability of the proposed method are validated based on a medical captioning generation model.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Exploring better image captioning with grid features
    Yan, Jie
    Xie, Yuxiang
    Guo, Yanming
    Wei, Yingmei
    Luan, Xidao
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3541 - 3556
  • [22] Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning
    Wang, Jing
    Tang, Jinhui
    Luo, Jiebo
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4346 - 4354
  • [23] Spatial-based chipless RFID system
    Nguyen D.H.
    Zomorrodi M.
    Karmakar N.C.
    IEEE Journal of Radio Frequency Identification, 2019, 3 (01): : 46 - 55
  • [24] Spatial-Temporal Attention for Image Captioning
    Zhou, Junwei
    Wang, Xi
    Han, Jizhong
    Hu, Songlin
    Gao, Hongchao
    2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
  • [25] SPT: Spatial Pyramid Transformer for Image Captioning
    Zhang, Haonan
    Zeng, Pengpeng
    Gao, Lianli
    Lyu, Xinyu
    Song, Jingkuan
    Shen, Heng Tao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (06) : 4829 - 4842
  • [26] Video Captioning based on Image Captioning as Subsidiary Content
    Vaishnavi, J.
    Narmatha, V
    2022 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL, COMPUTING, COMMUNICATION AND SUSTAINABLE TECHNOLOGIES (ICAECT), 2022,
  • [27] Position-dependent disturbance rejection using spatial-based adaptive feedback linearization repetitive control
    Chen, Cheng-Lun
    Yang, Yen-Hsiu
    INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROL, 2009, 19 (12) : 1337 - 1363
  • [28] Spatial-Spectral Transformer With Conditional Position Encoding for Hyperspectral Image Classification
    Ahmad, Muhammad
    Usama, Muhammad
    Khan, Adil Mehmood
    Distefano, Salvatore
    Altuwaijri, Hamad Ahmed
    Mazzara, Manuel
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21
  • [29] Exploring Diverse In-Context Configurations for Image Captioning
    Yang, Xu
    Wu, Yongliang
    Yang, Mingzhuo
    Chen, Haokun
    Geng, Xin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [30] Auto-Encoding and Distilling Scene Graphs for Image Captioning
    Yang, Xu
    Zhang, Hanwang
    Cai, Jianfei
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (05) : 2313 - 2327