Improving Image Captioning through Visual and Semantic Mutual Promotion

被引:0
|
作者
Zhang, Jing [1 ]
Xie, Yingshuai [1 ]
Liu, Xiaoqiang [1 ]
机构
[1] East China Univ Sci & Technol, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
上海市自然科学基金;
关键词
Image Captioning; Transformer; Co-attention; Multimodal Fusion;
D O I
10.1145/3581783.3612480
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current image captioning methods commonly use semantic attributes extracted by an object detector to guide visual representation, leaving the mutual guidance and enhancement between vision and semantics under-explored. Neurological studies have revealed that the visual cortex of the brain plays a crucial role in recognizing visual objects, while the prefrontal cortex is involved in the integration of contextual semantics. Inspired by the above studies, we propose a novel Visual-Semantic Transformer (VST) to model the neural interaction between vision and semantics, which explores the mechanism of deep fusion and mutual promotion of multimodal information, realizing more accurate image captioning. To better facilitate the complementary strengths between visual objects and semantic contexts, we propose a global position-sensitive co-attention encoder to realize globally associative, position-aware visual and semantic co-interaction through a mutual cross-attention mechanism. In addition, a multimodal mixed attention module is proposed in the decoder, which achieves adaptive multimodal feature fusion for enhancing the decoding capability. Experimental evidence shows that our VST significantly surpasses the state-of-the-art approaches on MSCOCO dataset and reaches the excellent CIDEr score of 142% on the Karpathy test split.
引用
收藏
页码:4716 / 4724
页数:9
相关论文
共 50 条
  • [31] Semantic interdisciplinary evaluation of image captioning models
    Sirisha, Uddagiri
    Chandana, Bolem Sai
    COGENT ENGINEERING, 2022, 9 (01):
  • [32] StructCap: Structured Semantic Embedding for Image Captioning
    Chen, Fuhai
    Ji, Rongrong
    Su, Jinsong
    Wu, Yongjian
    Wu, Yunsheng
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 46 - 54
  • [33] Dense semantic embedding network for image captioning
    Xiao, Xinyu
    Wang, Lingfeng
    Ding, Kun
    Xiang, Shiming
    Pan, Chunhong
    PATTERN RECOGNITION, 2019, 90 : 285 - 296
  • [34] Exploring Visual Relationship for Image Captioning
    Yao, Ting
    Pan, Yingwei
    Li, Yehao
    Mei, Tao
    COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 711 - 727
  • [35] Visual Cluster Grounding for Image Captioning
    Jiang, Wenhui
    Zhu, Minwei
    Fang, Yuming
    Shi, Guangming
    Zhao, Xiaowei
    Liu, Yang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3920 - 3934
  • [36] A visual persistence model for image captioning
    Wang, Yiyu
    Xu, Jungang
    Sun, Yingfei
    NEUROCOMPUTING, 2022, 468 : 48 - 59
  • [37] Triangle-Reward Reinforcement Learning: Visual-Linguistic Semantic Alignment for Image Captioning
    Nie, Weizhi
    Li, Jiesi
    Xu, Ning
    Liu, An-An
    Li, Xuanya
    Zhang, Yongdong
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 4510 - 4518
  • [38] Multi-level semantic-aware transformer for image captioning
    Xu, Qin
    Song, Shan
    Wu, Qihang
    Jiang, Bo
    Luo, Bin
    Tang, Jinhui
    NEURAL NETWORKS, 2025, 187
  • [39] A Sub-captions Semantic-Guided Network for Image Captioning
    Tian, Wei-Dong
    Zhu, Jun-jun
    Wu, Shuang
    Zhao, Zhong-Qiu
    Zhang, Yu-Zheng
    Zhang, Tian-yu
    INTELLIGENT COMPUTING METHODOLOGIES, PT III, 2022, 13395 : 367 - 379
  • [40] Dual-visual collaborative enhanced transformer for image captioning
    Mou, Zhenping
    Song, Tianqi
    Luo, Hong
    MULTIMEDIA SYSTEMS, 2025, 31 (02)