Improving Image Captioning through Visual and Semantic Mutual Promotion

被引:0
|
作者
Zhang, Jing [1 ]
Xie, Yingshuai [1 ]
Liu, Xiaoqiang [1 ]
机构
[1] East China Univ Sci & Technol, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
上海市自然科学基金;
关键词
Image Captioning; Transformer; Co-attention; Multimodal Fusion;
D O I
10.1145/3581783.3612480
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current image captioning methods commonly use semantic attributes extracted by an object detector to guide visual representation, leaving the mutual guidance and enhancement between vision and semantics under-explored. Neurological studies have revealed that the visual cortex of the brain plays a crucial role in recognizing visual objects, while the prefrontal cortex is involved in the integration of contextual semantics. Inspired by the above studies, we propose a novel Visual-Semantic Transformer (VST) to model the neural interaction between vision and semantics, which explores the mechanism of deep fusion and mutual promotion of multimodal information, realizing more accurate image captioning. To better facilitate the complementary strengths between visual objects and semantic contexts, we propose a global position-sensitive co-attention encoder to realize globally associative, position-aware visual and semantic co-interaction through a mutual cross-attention mechanism. In addition, a multimodal mixed attention module is proposed in the decoder, which achieves adaptive multimodal feature fusion for enhancing the decoding capability. Experimental evidence shows that our VST significantly surpasses the state-of-the-art approaches on MSCOCO dataset and reaches the excellent CIDEr score of 142% on the Karpathy test split.
引用
收藏
页码:4716 / 4724
页数:9
相关论文
共 50 条
  • [21] A novel image captioning model with visual-semantic similarities and visual representations re-weighting
    Thobhani, Alaa
    Zou, Beiji
    Kui, Xiaoyan
    Al-Shargabi, Asma A.
    Derea, Zaid
    Abdussalam, Amr
    Asham, Mohammed A.
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2024, 36 (07)
  • [22] Improving Intra- and Inter-Modality Visual Relation for Image Captioning
    Wang, Yong
    Zhang, WenKai
    Liu, Qing
    Zhang, Zhengyuan
    Gao, Xin
    Sun, Xian
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4190 - 4198
  • [23] Improving Image Captioning with Image Concepts of Words
    Wang, Yiyu
    Xiang, Xunzhi
    Jing, Kun
    Xu, Jungang
    Sun, Yingfei
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, KSEM 2024, 2024, 14885 : 358 - 370
  • [24] Semantic association enhancement transformer with relative position for image captioning
    Xin Jia
    Yunbo Wang
    Yuxin Peng
    Shengyong Chen
    Multimedia Tools and Applications, 2022, 81 : 21349 - 21367
  • [25] Semantic association enhancement transformer with relative position for image captioning
    Jia, Xin
    Wang, Yunbo
    Peng, Yuxin
    Chen, Shengyong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (15) : 21349 - 21367
  • [26] Image captioning: Semantic selection unit with stacked residual attention
    Song, Lifei
    Li, Fei
    Wang, Ying
    Liu, Yu
    Wang, Yuanhua
    Xiang, Shiming
    IMAGE AND VISION COMPUTING, 2024, 144
  • [27] Graph-based image captioning with semantic and spatial features
    Parseh, Mohammad Javad
    Ghadiri, Saeed
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2025, 133
  • [28] A visual question answering model based on image captioning
    Zhou, Kun
    Liu, Qiongjie
    Zhao, Dexin
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [29] Image captioning via semantic element embedding
    Zhang, Xiaodan
    He, Shengfeng
    Song, Xinhang
    Lau, Rynson W. H.
    Jiao, Jianbin
    Ye, Qixiang
    NEUROCOMPUTING, 2020, 395 : 212 - 221
  • [30] Integrating Scene Semantic Knowledge into Image Captioning
    Wei, Haiyang
    Li, Zhixin
    Huang, Feicheng
    Zhang, Canlong
    Ma, Huifang
    Shi, Zhongzhi
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (02)