Double-Stream Position Learning Transformer Network for Image Captioning

被引:32
|
作者
Jiang, Weitao [1 ]
Zhou, Wei [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Peoples R China
关键词
Transformers; Feature extraction; Visualization; Decoding; Convolutional neural networks; Task analysis; Semantics; Image captioning; transformer; convolutional position learning; attention mechanism;
D O I
10.1109/TCSVT.2022.3181490
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image captioning has made significant achievement through developing feature extractor and model architecture. Recently, the image region features extracted by object detector prevail in most existing models. However, region features are criticized for the lacking of background and full contextual information. This problem can be remedied by providing some complementary visual information from patch features. In this paper, we propose a Double-Stream Position Learning Transformer Network (DSPLTN) which exploits the advantages of region features and patch features. Specifically, the region-stream encoder utilizes a Transformer encoder with Relative Position Learning (RPL) module to enhance the representations of region features through modeling the relationships between regions and positions respectively. As for the patch-stream encoder, we introduce convolutional neural network into the vanilla Transformer encoder and propose a novel Convolutional Position Learning (CPL) module to encode the position relationships between patches. CPL improves the ability of relationship modeling by combining the position and visual content of patches. Incorporating CPL into the Transformer encoder can synthesize the benefits of convolution in local relation modeling and self-attention in global feature fusion, thereby compensating for the information loss caused by the flattening operation of 2D feature maps to 1D patches. Furthermore, an Adaptive Fusion Attention (AFA) mechanism is proposed to balance the contribution of enhanced region and patch features. Extensive experiments on MSCOCO demonstrate the effectiveness of the double-stream encoder and CPL, and show the superior performance of DSPLTN.
引用
收藏
页码:7706 / 7718
页数:13
相关论文
共 50 条
  • [1] Interactive Concept Network Enhanced Transformer for Remote Sensing Image Captioning
    Zhang, Cheng
    Ren, Zhongle
    Hou, Biao
    Meng, Jianhua
    Li, Weibin
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [2] Noise Augmented Double-Stream Graph Convolutional Networks for Image Captioning
    Wu, Lingxiang
    Xu, Min
    Sang, Lei
    Yao, Ting
    Mei, Tao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (08) : 3118 - 3127
  • [3] Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning
    Kandala, Hitesh
    Saha, Sudipan
    Banerjee, Biplab
    Zhu, Xiao Xiang
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [4] Dual Position Relationship Transformer for Image Captioning
    Wang, Yaohan
    Qian, Wenhua
    Nie, Rencan
    Xu, Dan
    Cao, Jinde
    Kim, Pyoungwon
    BIG DATA, 2022, 10 (06) : 515 - 527
  • [5] Prior Knowledge-Guided Transformer for Remote Sensing Image Captioning
    Meng, Lingwu
    Wang, Jing
    Yang, Yang
    Xiao, Liang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61 : 1 - 13
  • [6] Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer
    Zhuang, Shuo
    Wang, Ping
    Wang, Gang
    Wang, Di
    Chen, Jinyong
    Gao, Feng
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [7] Triple-Stream Commonsense Circulation Transformer Network for Image Captioning
    Li, Jianchao
    Zhou, Wei
    Wang, Kai
    Hu, Haifeng
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 249
  • [8] Cascade Semantic Prompt Alignment Network for Image Captioning
    Li, Jingyu
    Zhang, Lei
    Zhang, Kun
    Hu, Bo
    Xie, Hongtao
    Mao, Zhendong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5266 - 5281
  • [9] Multi-Gate Attention Network for Image Captioning
    Jiang, Weitao
    Li, Xiying
    Hu, Haifeng
    Lu, Qiang
    Liu, Bohong
    IEEE ACCESS, 2021, 9 : 69700 - 69709
  • [10] A Position-Aware Transformer for Image Captioning
    Deng, Zelin
    Zhou, Bo
    He, Pei
    Huang, Jianfeng
    Alfarraj, Osama
    Tolba, Amr
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (01): : 2065 - 2081