Caption TLSTMs: combining transformer with LSTMs for image captioning

被引:8
作者
Yan, Jie [1 ]
Xie, Yuxiang [1 ]
Luan, Xidao [2 ]
Guo, Yanming [1 ]
Gong, Quanzhi [1 ]
Feng, Suru [1 ]
机构
[1] Natl Univ Def Technol, Coll Syst Engn, Changsha 410000, Hunan, Peoples R China
[2] Changsha Univ, Coll Comp Engn & Appl Math, Changsha 410000, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Transformer; Abstract scene graph; Deep learning; LANGUAGE;
D O I
10.1007/s13735-022-00228-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image to captions has attracted widespread attention over the years. Recurrent neural networks (RNN) and their corresponding variants have been the mainstream when it comes to dealing with image captioning task for a long time. However, transformer-based models have shown powerful and promising performance on visual tasks contrary to classic neural networks. In order to extract richer and more robust multimodal intersection feature representation, we improve the original abstract scene graph to caption model and propose the Caption TLSTMs which is made up of two LSTMs with Transformer blocks in the middle of them in this paper. Compared with the model before improvement, the architecture of our Caption TLSTMs enables the entire network to make the most of the long-term dependencies and feature representation ability of the LSTM, while encoding the multimodal textual, visual and graphic information with the transformer blocks as well. Finally, experiments on VisualGenome and MSCOCO datasets have shown good performance in improving the general image caption generation quality, demonstrating the effectiveness of the Caption TLSTMs model.
引用
收藏
页码:111 / 121
页数:11
相关论文
共 45 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
[Anonymous], 2012, 26 AAAI C ART INT
[3]  
[Anonymous], 2012, P 13 C EUR CHAPT ASS
[4]  
Banerjee Satanjeev, 2005, P ACL WORKSH INTR EX, P65
[5]   Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs [J].
Chen, Shizhe ;
Jin, Qin ;
Wang, Peng ;
Wu, Qi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9959-9968
[6]   Meshed-Memory Transformer for Image Captioning [J].
Cornia, Marcella ;
Stefanini, Matteo ;
Baraldi, Lorenzo ;
Cucchiara, Rita .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10575-10584
[7]  
Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978
[8]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[9]   Every Picture Tells a Story: Generating Sentences from Images [J].
Farhadi, Ali ;
Hejrati, Mohsen ;
Sadeghi, Mohammad Amin ;
Young, Peter ;
Rashtchian, Cyrus ;
Hockenmaier, Julia ;
Forsyth, David .
COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+
[10]  
Graves A, 2013, 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P273, DOI 10.1109/ASRU.2013.6707742