Improving Image Captioning through Visual and Semantic Mutual Promotion

被引：0

作者：

Zhang, Jing ^{[1
]}

Xie, Yingshuai ^{[1
]}

Liu, Xiaoqiang ^{[1
]}

机构：

[1] East China Univ Sci & Technol, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

上海市自然科学基金;

关键词：

Image Captioning; Transformer; Co-attention; Multimodal Fusion;

D O I：

10.1145/3581783.3612480

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current image captioning methods commonly use semantic attributes extracted by an object detector to guide visual representation, leaving the mutual guidance and enhancement between vision and semantics under-explored. Neurological studies have revealed that the visual cortex of the brain plays a crucial role in recognizing visual objects, while the prefrontal cortex is involved in the integration of contextual semantics. Inspired by the above studies, we propose a novel Visual-Semantic Transformer (VST) to model the neural interaction between vision and semantics, which explores the mechanism of deep fusion and mutual promotion of multimodal information, realizing more accurate image captioning. To better facilitate the complementary strengths between visual objects and semantic contexts, we propose a global position-sensitive co-attention encoder to realize globally associative, position-aware visual and semantic co-interaction through a mutual cross-attention mechanism. In addition, a multimodal mixed attention module is proposed in the decoder, which achieves adaptive multimodal feature fusion for enhancing the decoding capability. Experimental evidence shows that our VST significantly surpasses the state-of-the-art approaches on MSCOCO dataset and reaches the excellent CIDEr score of 142% on the Karpathy test split.

引用

页码：4716 / 4724

页数：9

共 50 条

[31] Semantic interdisciplinary evaluation of image captioning models
Sirisha, Uddagiri
Chandana, Bolem Sai
COGENT ENGINEERING, 2022, 9 (01):
[32] StructCap: Structured Semantic Embedding for Image Captioning
Chen, Fuhai
Ji, Rongrong
Su, Jinsong
Wu, Yongjian
Wu, Yunsheng
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 46 - 54
[33] Dense semantic embedding network for image captioning
Xiao, Xinyu
Wang, Lingfeng
Ding, Kun
Xiang, Shiming
Pan, Chunhong
PATTERN RECOGNITION, 2019, 90 : 285 - 296
[34] Exploring Visual Relationship for Image Captioning
Yao, Ting
Pan, Yingwei
Li, Yehao
Mei, Tao
COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 711 - 727
[35] Visual Cluster Grounding for Image Captioning
Jiang, Wenhui
Zhu, Minwei
Fang, Yuming
Shi, Guangming
Zhao, Xiaowei
Liu, Yang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3920 - 3934
[36] A visual persistence model for image captioning
Wang, Yiyu
Xu, Jungang
Sun, Yingfei
NEUROCOMPUTING, 2022, 468 : 48 - 59
[37] Triangle-Reward Reinforcement Learning: Visual-Linguistic Semantic Alignment for Image Captioning
Nie, Weizhi
Li, Jiesi
Xu, Ning
Liu, An-An
Li, Xuanya
Zhang, Yongdong
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4510 - 4518
[38] Multi-level semantic-aware transformer for image captioning
Xu, Qin
Song, Shan
Wu, Qihang
Jiang, Bo
Luo, Bin
Tang, Jinhui
NEURAL NETWORKS, 2025, 187
[39] A Sub-captions Semantic-Guided Network for Image Captioning
Tian, Wei-Dong
Zhu, Jun-jun
Wu, Shuang
Zhao, Zhong-Qiu
Zhang, Yu-Zheng
Zhang, Tian-yu
INTELLIGENT COMPUTING METHODOLOGIES, PT III, 2022, 13395 : 367 - 379
[40] Dual-visual collaborative enhanced transformer for image captioning
Mou, Zhenping
Song, Tianqi
Luo, Hong
MULTIMEDIA SYSTEMS, 2025, 31 (02)

← 1 2 3 4 5 →