Double-Stream Position Learning Transformer Network for Image Captioning

被引：32

作者：

Jiang, Weitao ^{[1
]}

Zhou, Wei ^{[1
]}

Hu, Haifeng ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 11期

关键词：

Transformers; Feature extraction; Visualization; Decoding; Convolutional neural networks; Task analysis; Semantics; Image captioning; transformer; convolutional position learning; attention mechanism;

D O I：

10.1109/TCSVT.2022.3181490

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Image captioning has made significant achievement through developing feature extractor and model architecture. Recently, the image region features extracted by object detector prevail in most existing models. However, region features are criticized for the lacking of background and full contextual information. This problem can be remedied by providing some complementary visual information from patch features. In this paper, we propose a Double-Stream Position Learning Transformer Network (DSPLTN) which exploits the advantages of region features and patch features. Specifically, the region-stream encoder utilizes a Transformer encoder with Relative Position Learning (RPL) module to enhance the representations of region features through modeling the relationships between regions and positions respectively. As for the patch-stream encoder, we introduce convolutional neural network into the vanilla Transformer encoder and propose a novel Convolutional Position Learning (CPL) module to encode the position relationships between patches. CPL improves the ability of relationship modeling by combining the position and visual content of patches. Incorporating CPL into the Transformer encoder can synthesize the benefits of convolution in local relation modeling and self-attention in global feature fusion, thereby compensating for the information loss caused by the flattening operation of 2D feature maps to 1D patches. Furthermore, an Adaptive Fusion Attention (AFA) mechanism is proposed to balance the contribution of enhanced region and patch features. Extensive experiments on MSCOCO demonstrate the effectiveness of the double-stream encoder and CPL, and show the superior performance of DSPLTN.

引用

页码：7706 / 7718

页数：13

共 50 条

[31] Rotary Transformer for Image Captioning
Qiu, Yile
Zhu, Li
SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328
[32] Context-aware transformer for image captioning
Yang, Xin
Wang, Ying
Chen, Haishun
Li, Jie
Huang, Tingting
NEUROCOMPUTING, 2023, 549
[33] PCATNet: Position-Class Awareness Transformer for Image Captioning
Tang, Ziwei
Yi, Yaohua
Yu, Changhui
Yin, Aiguo
CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (03): : 6007 - 6022
[34] Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer
Liu, Chenyang
Zhao, Rui
Shi, Zhenwei
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
[35] A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning
Sun, Dongwei
Bao, Yajie
Liu, Junmin
Cao, Xiangyong
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 18727 - 18738
[36] A Decoupling Paradigm With Prompt Learning for Remote Sensing Image Change Captioning
Liu, Chenyang
Zhao, Rui
Chen, Jianqi
Qi, Zipeng
Zou, Zhengxia
Shi, Zhenwei
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[37] SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection
Zhang, Cui
Wang, Liejun
Cheng, Shuli
Li, Yongming
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[38] Unpaired Image Captioning With semantic-Constrained Self-Learning
Ben, Huixia
Pan, Yingwei
Li, Yehao
Yao, Ting
Hong, Richang
Wang, Meng
Mei, Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 904 - 916
[39] Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning
Dong, Xinzhi
Long, Chengjiang
Xu, Wenju
Xiao, Chunxia
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2615 - 2624
[40] ArCo: Attention-reinforced transformer with contrastive learning for image captioning
Wang, Zhongan
Shi, Shuai
Zhai, Zirong
Wu, Yingna
Yang, Rui
IMAGE AND VISION COMPUTING, 2022, 128

← 1 2 3 4 5 →