Double-Stream Position Learning Transformer Network for Image Captioning

被引:32
作者
Jiang, Weitao [1 ]
Zhou, Wei [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Peoples R China
关键词
Transformers; Feature extraction; Visualization; Decoding; Convolutional neural networks; Task analysis; Semantics; Image captioning; transformer; convolutional position learning; attention mechanism;
D O I
10.1109/TCSVT.2022.3181490
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image captioning has made significant achievement through developing feature extractor and model architecture. Recently, the image region features extracted by object detector prevail in most existing models. However, region features are criticized for the lacking of background and full contextual information. This problem can be remedied by providing some complementary visual information from patch features. In this paper, we propose a Double-Stream Position Learning Transformer Network (DSPLTN) which exploits the advantages of region features and patch features. Specifically, the region-stream encoder utilizes a Transformer encoder with Relative Position Learning (RPL) module to enhance the representations of region features through modeling the relationships between regions and positions respectively. As for the patch-stream encoder, we introduce convolutional neural network into the vanilla Transformer encoder and propose a novel Convolutional Position Learning (CPL) module to encode the position relationships between patches. CPL improves the ability of relationship modeling by combining the position and visual content of patches. Incorporating CPL into the Transformer encoder can synthesize the benefits of convolution in local relation modeling and self-attention in global feature fusion, thereby compensating for the information loss caused by the flattening operation of 2D feature maps to 1D patches. Furthermore, an Adaptive Fusion Attention (AFA) mechanism is proposed to balance the contribution of enhanced region and patch features. Extensive experiments on MSCOCO demonstrate the effectiveness of the double-stream encoder and CPL, and show the superior performance of DSPLTN.
引用
收藏
页码:7706 / 7718
页数:13
相关论文
共 50 条
  • [31] Rotary Transformer for Image Captioning
    Qiu, Yile
    Zhu, Li
    SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328
  • [32] Context-aware transformer for image captioning
    Yang, Xin
    Wang, Ying
    Chen, Haishun
    Li, Jie
    Huang, Tingting
    NEUROCOMPUTING, 2023, 549
  • [33] PCATNet: Position-Class Awareness Transformer for Image Captioning
    Tang, Ziwei
    Yi, Yaohua
    Yu, Changhui
    Yin, Aiguo
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (03): : 6007 - 6022
  • [34] Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer
    Liu, Chenyang
    Zhao, Rui
    Shi, Zhenwei
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [35] A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning
    Sun, Dongwei
    Bao, Yajie
    Liu, Junmin
    Cao, Xiangyong
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 18727 - 18738
  • [36] A Decoupling Paradigm With Prompt Learning for Remote Sensing Image Change Captioning
    Liu, Chenyang
    Zhao, Rui
    Chen, Jianqi
    Qi, Zipeng
    Zou, Zhengxia
    Shi, Zhenwei
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [37] SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection
    Zhang, Cui
    Wang, Liejun
    Cheng, Shuli
    Li, Yongming
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [38] Unpaired Image Captioning With semantic-Constrained Self-Learning
    Ben, Huixia
    Pan, Yingwei
    Li, Yehao
    Yao, Ting
    Hong, Richang
    Wang, Meng
    Mei, Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 904 - 916
  • [39] Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning
    Dong, Xinzhi
    Long, Chengjiang
    Xu, Wenju
    Xiao, Chunxia
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2615 - 2624
  • [40] ArCo: Attention-reinforced transformer with contrastive learning for image captioning
    Wang, Zhongan
    Shi, Shuai
    Zhai, Zirong
    Wu, Yingna
    Yang, Rui
    IMAGE AND VISION COMPUTING, 2022, 128