Relational-Convergent Transformer for image captioning

被引:16
作者
Chen, Lizhi [1 ]
Yang, You [1 ,2 ]
Hu, Juntao [1 ]
Pan, Longyue [1 ]
Zhai, Hao [1 ]
机构
[1] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China
[2] Natl Ctr Appl Math Chongqing, Chongqing 401331, Peoples R China
关键词
Image captioning; Relational fusion; Relational-Convergent Attention;
D O I
10.1016/j.displa.2023.102377
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Image captioning describes the visual content of a given image by using natural language sentences, and plays a key role in the fusion and utilization of the image features. However, in the existing image captioning models, the decoder sometimes fails to efficiently capture the relationships between image features because of their lack of sequential dependencies. In this paper, we propose a Relational-Convergent Transformer (RCT) network to obtain complex intramodality representations in image captioning. In RCT, a Relational Fusion Module (RFM) is designed for capturing the local and global information of an image by a recursive fusion. Then, a Relational-Convergent Attention (RCA) is proposed, which is composed of a self-attention and a hierarchical fusion module for aggregating global relational information to extract a more comprehensive intramodal contextual representation. To validate the effectiveness of the proposed model, extensive experiments are conducted on the MSCOCO dataset. The experimental results show that the proposed method outperforms some of the state-of-the-art methods.
引用
收藏
页数:8
相关论文
共 46 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]  
Bengio S, 2015, Arxiv, DOI arXiv:1506.03099
[4]   Image Captioning with Memorized Knowledge [J].
Chen, Hui ;
Ding, Guiguang ;
Lin, Zijia ;
Guo, Yuchen ;
Shan, Caifeng ;
Han, Jungong .
COGNITIVE COMPUTATION, 2021, 13 (04) :807-820
[5]   SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning [J].
Chen, Long ;
Zhang, Hanwang ;
Xiao, Jun ;
Nie, Liqiang ;
Shao, Jian ;
Liu, Wei ;
Chua, Tat-Seng .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6298-6306
[6]  
Chen TY, 2021, PR MACH LEARN RES, V157, P1537
[7]   Meshed-Memory Transformer for Image Captioning [J].
Cornia, Marcella ;
Stefanini, Matteo ;
Baraldi, Lorenzo ;
Cucchiara, Rita .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10575-10584
[8]  
Denkowski M. J., 2014, P 9 WORKSHOP STAT MA, P376, DOI DOI 10.3115/V1/W14-3348
[9]   Every Picture Tells a Story: Generating Sentences from Images [J].
Farhadi, Ali ;
Hejrati, Mohsen ;
Sadeghi, Mohammad Amin ;
Young, Peter ;
Rashtchian, Cyrus ;
Hockenmaier, Julia ;
Forsyth, David .
COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+
[10]  
Gu JX, 2018, AAAI CONF ARTIF INTE, P6837