GVA: guided visual attention approach for automatic image caption generation

被引：0

作者：

Md. Bipul Hossen

Zhongfu Ye

Amr Abdussalam

Md. Imran Hossain

机构：

[1] University of Science and Technology of China,School of Information Science and Technology

[2] Pabna University of Science and Technology,Department of ICE

来源：

Multimedia Systems | 2024年 / 30卷

关键词：

Image captioning; Faster R-CNN; LSTM; Up–down model; Encoder–decoder framework;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Automated image caption generation with attention mechanisms focuses on visual features including objects, attributes, actions, and scenes of the image to understand and provide more detailed captions, which attains great attention in the multimedia field. However, deciding which aspects of an image to highlight for better captioning remains a challenge. Most advanced captioning models utilize only one attention module to assign attention weights to visual vectors, but this may not be enough to create an informative caption. To tackle this issue, we propose an innovative and well-designed Guided Visual Attention (GVA) approach, incorporating an additional attention mechanism to re-adjust the attentional weights on the visual feature vectors and feed the resulting context vector to the language LSTM. Utilizing the first-level attention module as guidance for the GVA module and re-weighting the attention weights significantly enhances the caption’s quality. Recently, deep neural networks have allowed the encoder-decoder architecture to make use visual attention mechanism, where faster R-CNN is used for extracting features in the encoder and a visual attention-based LSTM is applied in the decoder. Extensive experiments have been implemented on both the MS-COCO and Flickr30k benchmark datasets. Compared with state-of-the-art methods, our approach achieved an average improvement of 2.4% on BLEU@1 and 13.24% on CIDEr for the MSCOCO dataset, as well as 4.6% on BLEU@1 and 12.48% on CIDEr score for the Flickr30K datasets, based on the cross-entropy optimization. These results demonstrate the clear superiority of our proposed approach in comparison to existing methods using standard evaluation metrics. The implementing code can be found here: (https://github.com/mdbipu/GVA).

引用

共 149 条

[1]

Yuan A(2019)3G structure for image caption generation Neurocomputing 45 539-559

[2]

Li X(2023)From show to tell: a survey on deep learning-based image captioning IEEE Trans. Pattern Anal. Mach. Intell. 80 18413-18443

[3]

Lu X(2020)The synergy of double attention: combine sentence-level and word-level attention for image captioning Comput. Vis. Image Underst. 8 154953-154965

[4]

Stefanini M(2019)Multilayer dense attention model for image caption IEEE Access 79 11531-11549

[5]

Cornia M(2021)MRRC: multiple role representation crossover interpretation for image captioning with R-CNN feature distribution composition (FDC) Multimed. Tools Appl. 54 3157-3171

[6]

Baraldi L(2021)Cross-domain image captioning via cross-modal retrieval and model adaptation IEEE Trans. Image Process. 82 1223-1236

[7]

Cascianelli S(2022)Image captioning with novel topics guidance and retrieval-based topics re-weighting IEEE Trans. Multimed. 23 2413-2427

[8]

Fiameni G(2021)Adaptive attention-based high-level semantic introduction for image caption ACM Trans. Multimed. Comput. Commun. Appl. 11 134-143

[9]

Cucchiara R(2020)Stack-VS: stacked visual-semantic attention for image caption generation IEEE Access 17 1-20

[10]

Wei H(2022)A detailed review of prevailing image captioning methods using deep learning techniques Multimed. Tools Appl. 7 66680-66688

← 1 2 3 4 5 6 7 8 9 10 →