Improving Image Captioning through Visual and Semantic Mutual Promotion

被引：0

作者：

Zhang, Jing ^{[1
]}

Xie, Yingshuai ^{[1
]}

Liu, Xiaoqiang ^{[1
]}

机构：

[1] East China Univ Sci & Technol, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

上海市自然科学基金;

关键词：

Image Captioning; Transformer; Co-attention; Multimodal Fusion;

D O I：

10.1145/3581783.3612480

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Current image captioning methods commonly use semantic attributes extracted by an object detector to guide visual representation, leaving the mutual guidance and enhancement between vision and semantics under-explored. Neurological studies have revealed that the visual cortex of the brain plays a crucial role in recognizing visual objects, while the prefrontal cortex is involved in the integration of contextual semantics. Inspired by the above studies, we propose a novel Visual-Semantic Transformer (VST) to model the neural interaction between vision and semantics, which explores the mechanism of deep fusion and mutual promotion of multimodal information, realizing more accurate image captioning. To better facilitate the complementary strengths between visual objects and semantic contexts, we propose a global position-sensitive co-attention encoder to realize globally associative, position-aware visual and semantic co-interaction through a mutual cross-attention mechanism. In addition, a multimodal mixed attention module is proposed in the decoder, which achieves adaptive multimodal feature fusion for enhancing the decoding capability. Experimental evidence shows that our VST significantly surpasses the state-of-the-art approaches on MSCOCO dataset and reaches the excellent CIDEr score of 142% on the Karpathy test split.

引用

页码：4716 / 4724

页数：9

共 50 条

[41] Local-global visual interaction attention for image captioning
Wang, Changzhi
Gu, Xiaodong
DIGITAL SIGNAL PROCESSING, 2022, 130
[42] Semantic-Guided Selective Representation for Image Captioning
Li, Yinan
Ma, Yiwei
Zhou, Yiyi
Yu, Xiao
IEEE ACCESS, 2023, 11 : 14500 - 14510
[43] Adaptive Semantic-Enhanced Transformer for Image Captioning
Zhang, Jing
Fang, Zhongjun
Sun, Han
Wang, Zhe
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 1785 - 1796
[44] Weakly supervised grounded image captioning with semantic matching
Sen Du
Hong Zhu
Guangfeng Lin
Yuanyuan Liu
Dong Wang
Jing Shi
Zhong Wu
Applied Intelligence, 2024, 54 : 4300 - 4318
[45] Image captioning improved visual question answering
Himanshu Sharma
Anand Singh Jalal
Multimedia Tools and Applications, 2022, 81 : 34775 - 34796
[46] Improved Image Captioning via Semantic Feature Update
Tian, Peng
Mo, Hongwei
Jiang, Laihao
2021 PROCEEDINGS OF THE 40TH CHINESE CONTROL CONFERENCE (CCC), 2021, : 7938 - 7943
[47] Structural Semantic Adversarial Active Learning for Image Captioning
Zhang, Beichen
Li, Liang
Su, Li
Wang, Shuhui
Deng, Jincan
Zha, Zheng-Jun
Huang, Qingming
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1112 - 1121
[48] Improving Image Captioning with Language Modeling Regularizations
Ulusoy, Okan
Akgul, Ceyhun Burak
Anarim, Emin
2019 INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS CONFERENCE (ASYU), 2019, : 407 - 412
[49] Image captioning improved visual question answering
Sharma, Himanshu
Jalal, Anand Singh
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796
[50] Towards local visual modeling for image captioning
Ma, Yiwei
Ji, Jiayi
Sun, Xiaoshuai
Zhou, Yiyi
Ji, Rongrong
PATTERN RECOGNITION, 2023, 138

← 1 2 3 4 5 →