Enhancing visual contextual semantic information for image captioning

被引：0

作者：

Wang, Ronggui ^{[1
]}

Li, Shuo ^{[1
]}

Xue, Lixia ^{[1
]}

Yang, Juan ^{[1
]}

机构：

[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei, Peoples R China

来源：

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS | 2025年

关键词：

Image captioning; Dilated attention; Transformer; TRANSFORMER; ATTENTION;

D O I：

10.1007/s13042-025-02634-9

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image captioning is a fundamental task in the multimodal field, where the goal is to transform images into coherent text through deep network processing. Simple grid features often exhibit subpar performance due to their lack of contextual information and the presence of excessive noise. This paper aims to address these issues by enhancing fine-grained grid features with contextual semantic information within the original transformer model framework. We proposed a novel Dilated Attention Fusion Transformer (DAFT). Firstly, we integrate semantic segmentation features through a Feature Fusion Module Based on Cross-Attention to capture object-related information comprehensively. We then propose a novel multi-scale multi-head sparse attention mechanism based on grids, which improves granularity while reducing unnecessary noise and computational costs. Additionally, we employ a weighted residual connection method to fuse multi-layer information, generating richer representations. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our DAFT, with improvements of CIDEr from 133.2 to 135.7%, achieving significantly improved performance over the baseline. The source code is available at https://github.com/lishuo19981027/DAFT.

引用

页数：16

共 62 条

[1]

Anderson P, 2016, Arxiv, DOI arXiv:1607.08822

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3]

[Anonymous], 2004, ROUGE PACKAGE AUTOMA

[4]

Banerjee S., 2005, P ACL WORKSH INTR EX, P65

[5] Relational-Convergent Transformer for image captioning [J].

Chen, Lizhi ;

Yang, You ;

Hu, Juntao ;

Pan, Longyue ;

Zhai, Hao .

DISPLAYS, 2023, 77

[6] A Hierarchical Multimodal Attention-based Neural Network for Image Captioning [J].

Cheng, Yong ;

Huang, Fei ;

Zhou, Lian ;

Jin, Cheng ;

Zhang, Yuejie ;

Zhang, Tao .

SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, :889-892

[7] Meshed-Memory Transformer for Image Captioning [J].

Cornia, Marcella ;

Stefanini, Matteo ;

Baraldi, Lorenzo ;

Cucchiara, Rita .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10575-10584

[8] Stimulus-driven and concept-driven analysis for image caption generation [J].

Ding, Songtao ;

Qu, Shiru ;

Xi, Yuling ;

Wan, Shaohua .

NEUROCOMPUTING, 2020, 398 :520-530

[9]

Fan ZH, 2021, Arxiv, DOI arXiv:2106.10936

[10] Every Picture Tells a Story: Generating Sentences from Images [J].

Farhadi, Ali ;

Hejrati, Mohsen ;

Sadeghi, Mohammad Amin ;

Young, Peter ;

Rashtchian, Cyrus ;

Hockenmaier, Julia ;

Forsyth, David .

COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+

← 1 2 3 4 5 6 7 →