Complementary Shifted Transformer for Image Captioning

被引：1

作者：

Liu, Yanbo ^{[1
]}

Yang, You ^{[2
]}

Xiang, Ruoyu ^{[1
]}

Ma, Jixin ^{[1
]}

机构：

[1] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China

[2] Natl Ctr Appl Math Chongqing, Chongqing 401331, Peoples R China

来源：

NEURAL PROCESSING LETTERS | 2023年 / 55卷 / 06期

关键词：

Image captioning; Transformer; Positional encoding; Multi-branch self-attention; Spatial shift;

D O I：

10.1007/s11063-023-11314-0

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer-basedmodels have dominated many vision and language tasks, including image captioning. However, such models still suffer from the limitation of expressive ability and information loss during dimensionality reduction. In order to solve the above problems, this paper proposes a Complementary Shifted Transformer (CST) for image captioning. We first introduce a complementary Multi-branch Bi-positional encoding Self-Attention (MBSA) module. It utilizes both absolute and relative positional encoding to learn precise positional representations. Meanwhile, MBSA is equipped with Multi-Branch Architecture, which replicates multiple branches for each head. To improve the expressive ability of the model, we utilize the drop branch technique, which trains the branches in a complementary way. Furthermore, we propose a Spatial Shift Augmented module, which takes advantage of both low-level and high-level features to enhance visual features with fewer parameters. To validate our model, we conduct extensive experiments on the MSCOCO benchmark dataset. Compared to the state-of-the-art methods, the proposed CST achieves a competitive performance of 135.3% CIDEr (+0.2%) on the Karpathy split and 136.3% CIDEr (+0.9%) on the official online test server. In addition, we also evaluate the inference performance of our model on a novel object dataset. The source codes and trained models are publicly available at https://github.com/noonisy/CST.

引用

页码：8339 / 8363

页数：25

共 50 条

[1] Complementary Shifted Transformer for Image Captioning
Yanbo Liu
You Yang
Ruoyu Xiang
Jixin Ma
Neural Processing Letters, 2023, 55 : 8339 - 8363
[2] Rotary Transformer for Image Captioning
Qiu, Yile
Zhu, Li
SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328
[3] Boosted Transformer for Image Captioning
Li, Jiangyun
Yao, Peng
Guo, Longteng
Zhang, Weicun
APPLIED SCIENCES-BASEL, 2019, 9 (16):
[4] Transformer with a Parallel Decoder for Image Captioning
Wei, Peilang
Liu, Xu
Luo, Jun
Pu, Huayan
Huang, Xiaoxu
Wang, Shilong
Cao, Huajun
Yang, Shouhong
Zhuang, Xu
Wang, Jason
Yue, Hong
Ji, Cheng
Zhou, Mingliang
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (01)
[5] Image captioning with transformer and knowledge graph
Zhang, Yu
Shi, Xinyu
Mi, Siya
Yang, Xu
PATTERN RECOGNITION LETTERS, 2021, 143 (143) : 43 - 49
[6] Recurrent fusion transformer for image captioning
Mou, Zhenping
Yuan, Qiao
Song, Tianqi
SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
[7] Distance Transformer for Image Captioning
Wang, Jiarong
Lu, Tongwei
Liu, Xuanxuan
Yang, Qi
2021 4TH INTERNATIONAL CONFERENCE ON ROBOTICS, CONTROL AND AUTOMATION ENGINEERING (RCAE 2021), 2021, : 73 - 76
[8] Context-aware transformer for image captioning
Yang, Xin
Wang, Ying
Chen, Haishun
Li, Jie
Huang, Tingting
NEUROCOMPUTING, 2023, 549
[9] A Position-Aware Transformer for Image Captioning
Deng, Zelin
Zhou, Bo
He, Pei
Huang, Jianfeng
Alfarraj, Osama
Tolba, Amr
CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (01): : 2065 - 2081
[10] Full-Memory Transformer for Image Captioning
Lu, Tongwei
Wang, Jiarong
Min, Fen
SYMMETRY-BASEL, 2023, 15 (01):

← 1 2 3 4 5 →