Caption TLSTMs: combining transformer with LSTMs for image captioning

被引：0

作者：

Jie Yan

Yuxiang Xie

Xidao Luan

Yanming Guo

Quanzhi Gong

Suru Feng

机构：

[1] National University of Defense Technology,College of Systems Engineering

[2] Changsha University,College of Computer Engineering and Applied Mathematics

来源：

International Journal of Multimedia Information Retrieval | 2022年 / 11卷

关键词：

Image captioning; Transformer; Abstract scene graph; Deep learning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Image to captions has attracted widespread attention over the years. Recurrent neural networks (RNN) and their corresponding variants have been the mainstream when it comes to dealing with image captioning task for a long time. However, transformer-based models have shown powerful and promising performance on visual tasks contrary to classic neural networks. In order to extract richer and more robust multimodal intersection feature representation, we improve the original abstract scene graph to caption model and propose the Caption TLSTMs which is made up of two LSTMs with Transformer blocks in the middle of them in this paper. Compared with the model before improvement, the architecture of our Caption TLSTMs enables the entire network to make the most of the long-term dependencies and feature representation ability of the LSTM, while encoding the multimodal textual, visual and graphic information with the transformer blocks as well. Finally, experiments on VisualGenome and MSCOCO datasets have shown good performance in improving the general image caption generation quality, demonstrating the effectiveness of the Caption TLSTMs model.

引用

页码：111 / 121

页数：10

共 19 条

[1]

Hossain MZ(2019)A comprehensive survey of deep learning for image captioning ACM Comput Surv 51 1-36

[2]

Sohel F(2020)Dual-CNN: a convolutional language decoder for paragraph image captioning Neurocomputing 396 92-101

[3]

Shiratuddin MF(2018)Captioning transformer with stacked attention modules Appl Sci 8 739-73

[4]

Laga HJACS(2017)Visual genome: connecting language and vision using crowdsourced dense image annotations Int J Comput Vis 123 32-99

[5]

Li R(2015)Faster r-cnn: towards real-time object detection with region proposal networks Adv Neural Inf Process Syst 28 91-undefined

[6]

Liang H(undefined)undefined undefined undefined undefined-undefined

[7]

Shi Y(undefined)undefined undefined undefined undefined-undefined

[8]

Feng F(undefined)undefined undefined undefined undefined-undefined

[9]

Wang XJN(undefined)undefined undefined undefined undefined-undefined

[10]

Zhu X(undefined)undefined undefined undefined undefined-undefined

← 1 2 →