Double-Stream Position Learning Transformer Network for Image Captioning

被引:32
|
作者
Jiang, Weitao [1 ]
Zhou, Wei [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Peoples R China
关键词
Transformers; Feature extraction; Visualization; Decoding; Convolutional neural networks; Task analysis; Semantics; Image captioning; transformer; convolutional position learning; attention mechanism;
D O I
10.1109/TCSVT.2022.3181490
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image captioning has made significant achievement through developing feature extractor and model architecture. Recently, the image region features extracted by object detector prevail in most existing models. However, region features are criticized for the lacking of background and full contextual information. This problem can be remedied by providing some complementary visual information from patch features. In this paper, we propose a Double-Stream Position Learning Transformer Network (DSPLTN) which exploits the advantages of region features and patch features. Specifically, the region-stream encoder utilizes a Transformer encoder with Relative Position Learning (RPL) module to enhance the representations of region features through modeling the relationships between regions and positions respectively. As for the patch-stream encoder, we introduce convolutional neural network into the vanilla Transformer encoder and propose a novel Convolutional Position Learning (CPL) module to encode the position relationships between patches. CPL improves the ability of relationship modeling by combining the position and visual content of patches. Incorporating CPL into the Transformer encoder can synthesize the benefits of convolution in local relation modeling and self-attention in global feature fusion, thereby compensating for the information loss caused by the flattening operation of 2D feature maps to 1D patches. Furthermore, an Adaptive Fusion Attention (AFA) mechanism is proposed to balance the contribution of enhanced region and patch features. Extensive experiments on MSCOCO demonstrate the effectiveness of the double-stream encoder and CPL, and show the superior performance of DSPLTN.
引用
收藏
页码:7706 / 7718
页数:13
相关论文
共 50 条
  • [21] Multimodal Transformer With Multi-View Visual Representation for Image Captioning
    Yu, Jun
    Li, Jing
    Yu, Zhou
    Huang, Qingming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (12) : 4467 - 4480
  • [22] Semantic association enhancement transformer with relative position for image captioning
    Jia, Xin
    Wang, Yunbo
    Peng, Yuxin
    Chen, Shengyong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (15) : 21349 - 21367
  • [23] Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning
    Song, Zijie
    Hu, Zhenzhen
    Zhou, Yuanen
    Zhao, Ye
    Hong, Richang
    Wang, Meng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9008 - 9020
  • [24] Reinforcement Learning Transformer for Image Captioning Generation Model
    Yan, Zhaojie
    FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
  • [25] Visual Rotated Position Encoding Transformer for Remote Sensing Image Captioning
    Liu, Anli
    Meng, Lingwu
    Xiao, Liang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 20026 - 20040
  • [26] STLNet: Symmetric Transformer Learning Network for Remote Sensing Image Change Detection
    Mei, Liye
    Huang, Andong
    Ye, Zhaoyi
    Yalikun, Yaxiaer
    Wang, Ying
    Xu, Chuan
    Yang, Wei
    Li, Xinghua
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2025, 18 : 2655 - 2667
  • [27] Transformer based Multitask Learning for Image Captioning and Object Detection
    Basak, Debolena
    Srijith, P. K.
    Desarkar, Maunendra Sankar
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT II, PAKDD 2024, 2024, 14646 : 260 - 272
  • [28] Boosted Transformer for Image Captioning
    Li, Jiangyun
    Yao, Peng
    Guo, Longteng
    Zhang, Weicun
    APPLIED SCIENCES-BASEL, 2019, 9 (16):
  • [29] Region-Aware Image Captioning via Interaction Learning
    Liu, An-An
    Zhai, Yingchen
    Xu, Ning
    Nie, Weizhi
    Li, Wenhui
    Zhang, Yongdong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (06) : 3685 - 3696
  • [30] Context-assisted Transformer for Image Captioning
    Lian Z.
    Wang R.
    Li H.-C.
    Yao H.
    Hu X.-H.
    Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (09): : 1889 - 1903