Enhancing visual contextual semantic information for image captioning

被引:0
作者
Wang, Ronggui [1 ]
Li, Shuo [1 ]
Xue, Lixia [1 ]
Yang, Juan [1 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei, Peoples R China
关键词
Image captioning; Dilated attention; Transformer; TRANSFORMER; ATTENTION;
D O I
10.1007/s13042-025-02634-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning is a fundamental task in the multimodal field, where the goal is to transform images into coherent text through deep network processing. Simple grid features often exhibit subpar performance due to their lack of contextual information and the presence of excessive noise. This paper aims to address these issues by enhancing fine-grained grid features with contextual semantic information within the original transformer model framework. We proposed a novel Dilated Attention Fusion Transformer (DAFT). Firstly, we integrate semantic segmentation features through a Feature Fusion Module Based on Cross-Attention to capture object-related information comprehensively. We then propose a novel multi-scale multi-head sparse attention mechanism based on grids, which improves granularity while reducing unnecessary noise and computational costs. Additionally, we employ a weighted residual connection method to fuse multi-layer information, generating richer representations. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our DAFT, with improvements of CIDEr from 133.2 to 135.7%, achieving significantly improved performance over the baseline. The source code is available at https://github.com/lishuo19981027/DAFT.
引用
收藏
页数:16
相关论文
共 62 条
[11]  
Gao Y-M, 2022, P 2022 INT C MULT RE
[12]   Normalized and Geometry-Aware Self-Attention Network for Image Captioning [J].
Guo, Longteng ;
Liu, Jing ;
Zhu, Xinxin ;
Yao, Peng ;
Lu, Shichen ;
Lu, Hanqing .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10324-10333
[13]  
Gupta A, 2012, AAAI C ART INT
[14]  
Herdade S, 2019, Neural information processing systems.
[15]   Attention on Attention for Image Captioning [J].
Huang, Lun ;
Wang, Wenmin ;
Chen, Jie ;
Wei, Xiao-Yong .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4633-4642
[16]   Multi-Branch Distance-Sensitive Self-Attention Network for Image Captioning [J].
Ji, Jiayi ;
Huang, Xiaoyang ;
Sun, Xiaoshuai ;
Zhou, Yiyi ;
Luo, Gen ;
Cao, Liujuan ;
Liu, Jianzhuang ;
Shao, Ling ;
Ji, Rongrong .
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :3962-3974
[17]   In Defense of Grid Features for Visual Question Answering [J].
Jiang, Huaizu ;
Misra, Ishan ;
Rohrbach, Marcus ;
Learned-Miller, Erik ;
Chen, Xinlei .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10264-10273
[18]   Double-Stream Position Learning Transformer Network for Image Captioning [J].
Jiang, Weitao ;
Zhou, Wei ;
Hu, Haifeng .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) :7706-7718
[19]   Recurrent Fusion Network for Image Captioning [J].
Jiang, Wenhao ;
Ma, Lin ;
Jiang, Yu-Gang ;
Liu, Wei ;
Zhang, Tong .
COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 :510-526
[20]   LG-MLFormer: local and global MLP for image captioning [J].
Jiang, Zetao ;
Wang, Xiuxian ;
Zhai, Zhongyi ;
Cheng, Bo .
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)