XGL-T transformer model for intelligent image captioning

被引:0
作者
Dhruv Sharma
Chhavi Dhiman
Dinesh Kumar
机构
[1] Delhi Technological University,
来源
Multimedia Tools and Applications | 2024年 / 83卷
关键词
Activation Function; Attention; Computer Vision; Higher Order Interaction; Image Captioning; XGL; Transformer;
D O I
暂无
中图分类号
学科分类号
摘要
Image captioning extracts multiple semantic features from an image and integrates them into a sentence-level description. For efficient description of the captions, it becomes necessary to learn higher order interactions between detected objects and the relationship among them. Most of the existing systems take into account the first order interactions while ignoring the higher order ones. It is challenging to extract discriminant higher order semantics visual features in images with highly populated objects for caption generation. In this paper, an efficient higher order interaction learning framework is proposed using encoder-decoder based image captioning. A scaled version of Gaussian Error Linear Unit (GELU) activation function, x-GELU is introduced that controls the vanishing gradients and enhances the feature learning. To leverage higher order interactions among multiple objects, an efficient XGL Transformer (XGL-T) model is introduced that exploits both spatial and channel-wise attention by integrating four XGL attention modules in image encoder and one in Bilinear Long Short-Term Memory guided sentence decoder. The proposed model captures rich semantic concepts from objects, attributes, and their relationships. Extensive experiments are conducted on publicly available MSCOCO Karapathy test split and the best performance of the work is observed as 81.5 BLEU@1, 67.1 BLEU@2, 51.6 BLEU@3, 39.9 BLEU@4, 134 CIDEr, 59.9 ROUGE-L, 29.8 METEOR, 23.8 SPICE using CIDEr-D Score Optimization Strategy. The scores validate the significant improvements over state-of-the-art results. An ablation study is also carried out to support the experimental observations.
引用
收藏
页码:4219 / 4240
页数:21
相关论文
共 26 条
[1]  
Amirkhani D(2021)An objective method to evaluate exemplar-based inpainted images quality using Jaccard index Multimed Tools Appl 80 26199-26212
[2]  
Bastanfard A(2020)Image captioning using DenseNet network and adaptive attention Signal Process Image Commun 85 115836-2023
[3]  
Deng Z(2020)Squeeze-and-excitation networks IEEE Trans Pattern Anal Mach Intell 42 2011-1076
[4]  
Jiang Z(2017)Words matter: scene text for image classification and retrieval IEEE Transactions on Multimedia 19 1063-676
[5]  
Lan R(2017)Deep visual-semantic alignments for generating image descriptions IEEE Trans Pattern Anal Mach Intell 39 664-73
[6]  
Huang W(2017)Visual genome: connecting language and vision using crowdsourced dense image annotations Int J Comput Vis 123 32-4480
[7]  
Luo X(2020)Multimodal transformer with multiview visual representation for image captioning IEEE Trans Circ Syst Vid Technol 30 4467-undefined
[8]  
Hu J(undefined)undefined undefined undefined undefined-undefined
[9]  
Shen L(undefined)undefined undefined undefined undefined-undefined
[10]  
Albanie S(undefined)undefined undefined undefined undefined-undefined