A Hierarchical Multimodal Attention-based Neural Network for Image Captioning

被引:16
|
作者
Cheng, Yong [1 ]
Huang, Fei [1 ]
Zhou, Lian [1 ]
Jin, Cheng [1 ]
Zhang, Yuejie [1 ]
Zhang, Tao [2 ]
机构
[1] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Sch Comp Sci, Shanghai, Peoples R China
[2] Shanghai Univ Finance & Econ, Sch Informat Management & Engn, Shanghai, Peoples R China
关键词
Image Captioning; Multimodal Attention; Hierarchical Recurrent Neural Network; Long-Short Term Memory Model;
D O I
10.1145/3077136.3080671
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A novel hierarchical multimodal attention-based model is developed in this paper to generate more accurate and descriptive captions for images. Our model is an "end-to-end" neural network which contains three related sub-networks: a deep convolutional neural network to encode image contents, a recurrent neural network to identify the objects in images sequentially, and a multimodal attention-based recurrent neural network to generate image captions. The main contribution of our work is that the hierarchical structure and multimodal attention mechanism is both applied, thus each caption word can be generated with the multimodal attention on the intermediate semantic objects and the global visual content. Our experiments on two benchmark datasets have obtained very positive results.
引用
收藏
页码:889 / 892
页数:4
相关论文
共 50 条
  • [1] Hierarchical attention-based multimodal fusion for video captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Weichen, Sun
    Su, Fei
    Wang, Leiquan
    NEUROCOMPUTING, 2018, 315 : 362 - 370
  • [2] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    Applied Intelligence, 2023, 53 : 23349 - 23368
  • [3] Multimodal attention-based transformer for video captioning
    Munusamy, Hemalatha
    Sekhar, C. Chandra
    APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
  • [4] Hierarchical Attention Network for Image Captioning
    Wang, Weixuan
    Chen, Zhihong
    Hu, Haifeng
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8957 - 8964
  • [5] A New Attention-Based LSTM for Image Captioning
    Fen Xiao
    Wenfeng Xue
    Yanqing Shen
    Xieping Gao
    Neural Processing Letters, 2022, 54 : 3157 - 3171
  • [6] AttResNet: Attention-based ResNet for Image Captioning
    Feng, Yunmeng
    Lan, Long
    Zhang, Xiang
    Xu, Chuanfu
    Wang, Zhenghua
    Luo, Zhigang
    2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
  • [7] A Survey on Attention-Based Models for Image Captioning
    Osman, Asmaa A. E.
    Shalaby, Mohamed A. Wahby
    Soliman, Mona M.
    Elsayed, Khaled M.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (02) : 403 - 412
  • [8] A New Attention-Based LSTM for Image Captioning
    Xiao, Fen
    Xue, Wenfeng
    Shen, Yanqing
    Gao, Xieping
    NEURAL PROCESSING LETTERS, 2022, 54 (04) : 3157 - 3171
  • [9] Multimodal-enhanced hierarchical attention network for video captioning
    Zhong, Maosheng
    Chen, Youde
    Zhang, Hao
    Xiong, Hao
    Wang, Zhixiang
    MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482
  • [10] Multimodal-enhanced hierarchical attention network for video captioning
    Maosheng Zhong
    Youde Chen
    Hao Zhang
    Hao Xiong
    Zhixiang Wang
    Multimedia Systems, 2023, 29 : 2469 - 2482