Complementary Shifted Transformer for Image Captioning

被引:1
作者
Liu, Yanbo [1 ]
Yang, You [2 ]
Xiang, Ruoyu [1 ]
Ma, Jixin [1 ]
机构
[1] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China
[2] Natl Ctr Appl Math Chongqing, Chongqing 401331, Peoples R China
关键词
Image captioning; Transformer; Positional encoding; Multi-branch self-attention; Spatial shift;
D O I
10.1007/s11063-023-11314-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer-basedmodels have dominated many vision and language tasks, including image captioning. However, such models still suffer from the limitation of expressive ability and information loss during dimensionality reduction. In order to solve the above problems, this paper proposes a Complementary Shifted Transformer (CST) for image captioning. We first introduce a complementary Multi-branch Bi-positional encoding Self-Attention (MBSA) module. It utilizes both absolute and relative positional encoding to learn precise positional representations. Meanwhile, MBSA is equipped with Multi-Branch Architecture, which replicates multiple branches for each head. To improve the expressive ability of the model, we utilize the drop branch technique, which trains the branches in a complementary way. Furthermore, we propose a Spatial Shift Augmented module, which takes advantage of both low-level and high-level features to enhance visual features with fewer parameters. To validate our model, we conduct extensive experiments on the MSCOCO benchmark dataset. Compared to the state-of-the-art methods, the proposed CST achieves a competitive performance of 135.3% CIDEr (+0.2%) on the Karpathy split and 136.3% CIDEr (+0.9%) on the official online test server. In addition, we also evaluate the inference performance of our model on a novel object dataset. The source codes and trained models are publicly available at https://github.com/noonisy/CST.
引用
收藏
页码:8339 / 8363
页数:25
相关论文
共 50 条
[21]   Improved image captioning with subword units training and transformer [J].
Cai Q. ;
Li J. ;
Li H. ;
Zuo M. .
High Technology Letters, 2020, 26 (02) :211-216
[22]   Caption TLSTMs: combining transformer with LSTMs for image captioning [J].
Jie Yan ;
Yuxiang Xie ;
Xidao Luan ;
Yanming Guo ;
Quanzhi Gong ;
Suru Feng .
International Journal of Multimedia Information Retrieval, 2022, 11 :111-121
[23]   ReFormer: The Relational Transformer for Image Captioning [J].
Yang, Xuewen ;
Liu, Yingru ;
Wang, Xin .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :5398-5406
[24]   ETransCap: efficient transformer for image captioning [J].
Mundu, Albert ;
Singh, Satish Kumar ;
Dubey, Shiv Ram .
APPLIED INTELLIGENCE, 2024, 54 (21) :10748-10762
[25]   Direction Relation Transformer for Image Captioning [J].
Song, Zeliang ;
Zhou, Xiaofei ;
Dong, Linhua ;
Tan, Jianlong ;
Guo, Li .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :5056-5064
[26]   Transformer-based image captioning by leveraging sentence information [J].
Chahkandi, Vahid ;
Fadaeieslam, Mohammad Javad ;
Yaghmaee, Farzin .
JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (04)
[27]   XGL-T transformer model for intelligent image captioning [J].
Dhruv Sharma ;
Chhavi Dhiman ;
Dinesh Kumar .
Multimedia Tools and Applications, 2024, 83 :4219-4240
[28]   Triple-level relationship enhanced transformer for image captioning [J].
Zheng, Anqi ;
Zheng, Shiqi ;
Bai, Cong ;
Chen, Deng .
MULTIMEDIA SYSTEMS, 2023, 29 (04) :1955-1966
[29]   Sequential Transformer via an Outside-In Attention for image captioning [J].
Wei, Yiwei ;
Wu, Chunlei ;
Li, Guohe ;
Shi, Haitao .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 108
[30]   Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning [J].
Kandala, Hitesh ;
Saha, Sudipan ;
Banerjee, Biplab ;
Zhu, Xiao Xiang .
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19