Towards local visual modeling for image captioning

被引:49
|
作者
Ma, Yiwei [1 ]
Ji, Jiayi [1 ]
Sun, Xiaoshuai [1 ,2 ,4 ]
Zhou, Yiyi [1 ]
Ji, Rongrong [1 ,2 ,3 ]
机构
[1] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Media Analyt & Comp Lab, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
[4] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Room B705,Haiyu Adm Bldg,XMU Haiyun Campus, Xiamen 361005, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Attention mechanism; Local visual modeling;
D O I
10.1016/j.patcog.2023.109420
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Trans-former Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF). LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors. It reduces the difficulty of local object recognition during captioning. LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity. With these two novel designs, the proposed LSTNet can model the local visual information of grid features to improve the captioning qual-ity. To validate LSTNet, we conduct extensive experiments on the competitive MS-COCO benchmark. The experimental results show that LSTNet is not only capable of local visual modeling, but also outperforms a bunch of state-of-the-art captioning models on offline and online testings, i.e., 134.8 CIDEr and 136.3 CIDEr, respectively. Besides, the generalization of LSTNet is also verified on the Flickr8k and Flickr30k datasets. The source code is available on GitHub: https://www.github.com/xmu-xiaoma666/LSTNet .(c) 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] MODELING LOCAL AND GLOBAL CONTEXTS FOR IMAGE CAPTIONING
    Yao, Peng
    Li, Jiangyun
    Guo, Longteng
    Liu, Jing
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [2] Local-global visual interaction attention for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    DIGITAL SIGNAL PROCESSING, 2022, 130
  • [3] Modeling visual and word-conditional semantic attention for image captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Su, Fei
    Wang, Leiquan
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2018, 67 : 100 - 107
  • [4] Geometry-sensitive semantic modeling in visual and visual-language domains for image captioning
    Zhu, Wencai
    Jiang, Zetao
    He, Yuting
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 147
  • [5] Visual Relationship Attention for Image Captioning
    Zhang, Zongjian
    Wu, Qiang
    Wang, Yang
    Chen, Fang
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [6] Bengali Image Captioning with Visual Attention
    Ami, Amit Saha
    Humaira, Mayeesha
    Jim, Md Abidur Rahman Khan
    Paul, Shimul
    Shah, Faisal Muhammad
    2020 23RD INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (ICCIT 2020), 2020,
  • [7] Visual Cluster Grounding for Image Captioning
    Jiang, Wenhui
    Zhu, Minwei
    Fang, Yuming
    Shi, Guangming
    Zhao, Xiaowei
    Liu, Yang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3920 - 3934
  • [8] A visual persistence model for image captioning
    Wang, Yiyu
    Xu, Jungang
    Sun, Yingfei
    NEUROCOMPUTING, 2022, 468 : 48 - 59
  • [9] Visual enhanced gLSTM for image captioning
    Zhang, Jing
    Li, Kangkang
    Wang, Zhenkun
    Zhao, Xianwen
    Wang, Zhe
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184
  • [10] Exploring Visual Relationship for Image Captioning
    Yao, Ting
    Pan, Yingwei
    Li, Yehao
    Mei, Tao
    COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 711 - 727