Improving Image Captioning through Visual and Semantic Mutual Promotion

被引：0

作者：

Zhang, Jing ^{[1
]}

Xie, Yingshuai ^{[1
]}

Liu, Xiaoqiang ^{[1
]}

机构：

[1] East China Univ Sci & Technol, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

上海市自然科学基金;

关键词：

Image Captioning; Transformer; Co-attention; Multimodal Fusion;

D O I：

10.1145/3581783.3612480

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current image captioning methods commonly use semantic attributes extracted by an object detector to guide visual representation, leaving the mutual guidance and enhancement between vision and semantics under-explored. Neurological studies have revealed that the visual cortex of the brain plays a crucial role in recognizing visual objects, while the prefrontal cortex is involved in the integration of contextual semantics. Inspired by the above studies, we propose a novel Visual-Semantic Transformer (VST) to model the neural interaction between vision and semantics, which explores the mechanism of deep fusion and mutual promotion of multimodal information, realizing more accurate image captioning. To better facilitate the complementary strengths between visual objects and semantic contexts, we propose a global position-sensitive co-attention encoder to realize globally associative, position-aware visual and semantic co-interaction through a mutual cross-attention mechanism. In addition, a multimodal mixed attention module is proposed in the decoder, which achieves adaptive multimodal feature fusion for enhancing the decoding capability. Experimental evidence shows that our VST significantly surpasses the state-of-the-art approaches on MSCOCO dataset and reaches the excellent CIDEr score of 142% on the Karpathy test split.

引用

页码：4716 / 4724

页数：9

共 50 条

[21] A novel image captioning model with visual-semantic similarities and visual representations re-weighting
Thobhani, Alaa
Zou, Beiji
Kui, Xiaoyan
Al-Shargabi, Asma A.
Derea, Zaid
Abdussalam, Amr
Asham, Mohammed A.
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2024, 36 (07)
[22] Improving Intra- and Inter-Modality Visual Relation for Image Captioning
Wang, Yong
Zhang, WenKai
Liu, Qing
Zhang, Zhengyuan
Gao, Xin
Sun, Xian
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4190 - 4198
[23] Improving Image Captioning with Image Concepts of Words
Wang, Yiyu
Xiang, Xunzhi
Jing, Kun
Xu, Jungang
Sun, Yingfei
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, KSEM 2024, 2024, 14885 : 358 - 370
[24] Semantic association enhancement transformer with relative position for image captioning
Xin Jia
Yunbo Wang
Yuxin Peng
Shengyong Chen
Multimedia Tools and Applications, 2022, 81 : 21349 - 21367
[25] Semantic association enhancement transformer with relative position for image captioning
Jia, Xin
Wang, Yunbo
Peng, Yuxin
Chen, Shengyong
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (15) : 21349 - 21367
[26] Image captioning: Semantic selection unit with stacked residual attention
Song, Lifei
Li, Fei
Wang, Ying
Liu, Yu
Wang, Yuanhua
Xiang, Shiming
IMAGE AND VISION COMPUTING, 2024, 144
[27] Graph-based image captioning with semantic and spatial features
Parseh, Mohammad Javad
Ghadiri, Saeed
SIGNAL PROCESSING-IMAGE COMMUNICATION, 2025, 133
[28] A visual question answering model based on image captioning
Zhou, Kun
Liu, Qiongjie
Zhao, Dexin
MULTIMEDIA SYSTEMS, 2024, 30 (06)
[29] Image captioning via semantic element embedding
Zhang, Xiaodan
He, Shengfeng
Song, Xinhang
Lau, Rynson W. H.
Jiao, Jianbin
Ye, Qixiang
NEUROCOMPUTING, 2020, 395 : 212 - 221
[30] Integrating Scene Semantic Knowledge into Image Captioning
Wei, Haiyang
Li, Zhixin
Huang, Feicheng
Zhang, Canlong
Ma, Huifang
Shi, Zhongzhi
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (02)

← 1 2 3 4 5 →