Improving Image Captioning through Visual and Semantic Mutual Promotion

被引:0
|
作者
Zhang, Jing [1 ]
Xie, Yingshuai [1 ]
Liu, Xiaoqiang [1 ]
机构
[1] East China Univ Sci & Technol, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
上海市自然科学基金;
关键词
Image Captioning; Transformer; Co-attention; Multimodal Fusion;
D O I
10.1145/3581783.3612480
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current image captioning methods commonly use semantic attributes extracted by an object detector to guide visual representation, leaving the mutual guidance and enhancement between vision and semantics under-explored. Neurological studies have revealed that the visual cortex of the brain plays a crucial role in recognizing visual objects, while the prefrontal cortex is involved in the integration of contextual semantics. Inspired by the above studies, we propose a novel Visual-Semantic Transformer (VST) to model the neural interaction between vision and semantics, which explores the mechanism of deep fusion and mutual promotion of multimodal information, realizing more accurate image captioning. To better facilitate the complementary strengths between visual objects and semantic contexts, we propose a global position-sensitive co-attention encoder to realize globally associative, position-aware visual and semantic co-interaction through a mutual cross-attention mechanism. In addition, a multimodal mixed attention module is proposed in the decoder, which achieves adaptive multimodal feature fusion for enhancing the decoding capability. Experimental evidence shows that our VST significantly surpasses the state-of-the-art approaches on MSCOCO dataset and reaches the excellent CIDEr score of 142% on the Karpathy test split.
引用
收藏
页码:4716 / 4724
页数:9
相关论文
共 50 条
  • [41] Local-global visual interaction attention for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    DIGITAL SIGNAL PROCESSING, 2022, 130
  • [42] Semantic-Guided Selective Representation for Image Captioning
    Li, Yinan
    Ma, Yiwei
    Zhou, Yiyi
    Yu, Xiao
    IEEE ACCESS, 2023, 11 : 14500 - 14510
  • [43] Adaptive Semantic-Enhanced Transformer for Image Captioning
    Zhang, Jing
    Fang, Zhongjun
    Sun, Han
    Wang, Zhe
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 1785 - 1796
  • [44] Weakly supervised grounded image captioning with semantic matching
    Sen Du
    Hong Zhu
    Guangfeng Lin
    Yuanyuan Liu
    Dong Wang
    Jing Shi
    Zhong Wu
    Applied Intelligence, 2024, 54 : 4300 - 4318
  • [45] Image captioning improved visual question answering
    Himanshu Sharma
    Anand Singh Jalal
    Multimedia Tools and Applications, 2022, 81 : 34775 - 34796
  • [46] Improved Image Captioning via Semantic Feature Update
    Tian, Peng
    Mo, Hongwei
    Jiang, Laihao
    2021 PROCEEDINGS OF THE 40TH CHINESE CONTROL CONFERENCE (CCC), 2021, : 7938 - 7943
  • [47] Structural Semantic Adversarial Active Learning for Image Captioning
    Zhang, Beichen
    Li, Liang
    Su, Li
    Wang, Shuhui
    Deng, Jincan
    Zha, Zheng-Jun
    Huang, Qingming
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1112 - 1121
  • [48] Improving Image Captioning with Language Modeling Regularizations
    Ulusoy, Okan
    Akgul, Ceyhun Burak
    Anarim, Emin
    2019 INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS CONFERENCE (ASYU), 2019, : 407 - 412
  • [49] Image captioning improved visual question answering
    Sharma, Himanshu
    Jalal, Anand Singh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796
  • [50] Towards local visual modeling for image captioning
    Ma, Yiwei
    Ji, Jiayi
    Sun, Xiaoshuai
    Zhou, Yiyi
    Ji, Rongrong
    PATTERN RECOGNITION, 2023, 138