Double-Stream Position Learning Transformer Network for Image Captioning

被引:32
作者
Jiang, Weitao [1 ]
Zhou, Wei [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Peoples R China
关键词
Transformers; Feature extraction; Visualization; Decoding; Convolutional neural networks; Task analysis; Semantics; Image captioning; transformer; convolutional position learning; attention mechanism;
D O I
10.1109/TCSVT.2022.3181490
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image captioning has made significant achievement through developing feature extractor and model architecture. Recently, the image region features extracted by object detector prevail in most existing models. However, region features are criticized for the lacking of background and full contextual information. This problem can be remedied by providing some complementary visual information from patch features. In this paper, we propose a Double-Stream Position Learning Transformer Network (DSPLTN) which exploits the advantages of region features and patch features. Specifically, the region-stream encoder utilizes a Transformer encoder with Relative Position Learning (RPL) module to enhance the representations of region features through modeling the relationships between regions and positions respectively. As for the patch-stream encoder, we introduce convolutional neural network into the vanilla Transformer encoder and propose a novel Convolutional Position Learning (CPL) module to encode the position relationships between patches. CPL improves the ability of relationship modeling by combining the position and visual content of patches. Incorporating CPL into the Transformer encoder can synthesize the benefits of convolution in local relation modeling and self-attention in global feature fusion, thereby compensating for the information loss caused by the flattening operation of 2D feature maps to 1D patches. Furthermore, an Adaptive Fusion Attention (AFA) mechanism is proposed to balance the contribution of enhanced region and patch features. Extensive experiments on MSCOCO demonstrate the effectiveness of the double-stream encoder and CPL, and show the superior performance of DSPLTN.
引用
收藏
页码:7706 / 7718
页数:13
相关论文
共 50 条
  • [41] Vision Transformer With Contrastive Learning for Remote Sensing Image Scene Classification
    Bi, Meiqiao
    Wang, Minghua
    Li, Zhi
    Hong, Danfeng
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2023, 16 : 738 - 749
  • [42] Complementary Shifted Transformer for Image Captioning
    Liu, Yanbo
    Yang, You
    Xiang, Ruoyu
    Ma, Jixin
    NEURAL PROCESSING LETTERS, 2023, 55 (06) : 8339 - 8363
  • [43] A Review of Transformer-Based Approaches for Image Captioning
    Ondeng, Oscar
    Ouma, Heywood
    Akuon, Peter
    APPLIED SCIENCES-BASEL, 2023, 13 (19):
  • [44] Learning joint relationship attention network for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 211
  • [45] DBCTNet: Double Branch Convolution-Transformer Network for Hyperspectral Image Classification
    Xu, Rui
    Dong, Xue-Mei
    Li, Weijie
    Peng, Jiangtao
    Sun, Weiwei
    Xu, Yi
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
  • [46] Transformer with a Parallel Decoder for Image Captioning
    Wei, Peilang
    Liu, Xu
    Luo, Jun
    Pu, Huayan
    Huang, Xiaoxu
    Wang, Shilong
    Cao, Huajun
    Yang, Shouhong
    Zhuang, Xu
    Wang, Jason
    Yue, Hong
    Ji, Cheng
    Zhou, Mingliang
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (01)
  • [47] ReFormer: The Relational Transformer for Image Captioning
    Yang, Xuewen
    Liu, Yingru
    Wang, Xin
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5398 - 5406
  • [48] ETransCap: efficient transformer for image captioning
    Mundu, Albert
    Singh, Satish Kumar
    Dubey, Shiv Ram
    APPLIED INTELLIGENCE, 2024, 54 (21) : 10748 - 10762
  • [49] Image captioning with transformer and knowledge graph
    Zhang, Yu
    Shi, Xinyu
    Mi, Siya
    Yang, Xu
    PATTERN RECOGNITION LETTERS, 2021, 143 (143) : 43 - 49
  • [50] Complementary Shifted Transformer for Image Captioning
    Yanbo Liu
    You Yang
    Ruoyu Xiang
    Jixin Ma
    Neural Processing Letters, 2023, 55 : 8339 - 8363