Feedback Attention Model for Image Captioning

被引:0
作者
Lyu F. [1 ,5 ]
Hu F. [1 ,2 ]
Zhang Y. [3 ]
Xia Z. [1 ]
Sheng V.S. [4 ,6 ]
机构
[1] School of Electronic & Information Engineering, Suzhou University of Science and Technology, Suzhou
[2] Virtual Reality Key Laboratory of Intelligent Interaction and Application Technology of Suzhou, Suzhou
[3] School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an
[4] Department of Computer Science, University of Central Arkansas, Conway, 72035, AZ
[5] College of Intelligence and Computing, Tianjin University, Tianjin
[6] Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou
来源
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics | 2019年 / 31卷 / 07期
关键词
Attention feedback; Attention mechanism; Image captioning;
D O I
10.3724/SP.J.1089.2019.17505
中图分类号
学科分类号
摘要
The image captioning problem aims to let machine generate relevant sentence of a given image, which has been applied to the service robot. To improve the performance of image captioning effectively, some researchers propose to leverage the attention mechanism. However, the mechanism often suffers from distraction and sentence-disorder. In this paper, we propose an image captioning model based on a novel feed-back attention mechanism. In generating the corresponding language for a given image, the proposed model uses the attention feedback from the generated language. With the feedback, the attention heatmap of the original image will be revised, and the generated sentence will also be better. We evaluate the proposed method on three benchmark datasets, i.e., Flickr8k, Flickr30k and MSCOCO, and the experimental results show the superiority of the proposed method. © 2019, Beijing China Science Journal Publishing Co. Ltd. All right reserved.
引用
收藏
页码:1122 / 1129
页数:7
相关论文
共 36 条
  • [1] Vinyals O., Toshev A., Bengio S., Et al., Show and tell: a neural image caption generator, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156-3164, (2015)
  • [2] Xu K., Ba J., Kiros R., Et al., Show, attend and tell: neural image caption generation with visual attention, Proceedings of International Conference on Machine Learning, pp. 2048-2057, (2015)
  • [3] Krizhevsky A., Sutskever I., Hinton G.E., Imagenet classification with deep convolutional neural networks, Proceedings of the 25th International Conference on Neural Information Processing Systems, 1, pp. 1097-1105, (2012)
  • [4] Simonyan K., Zisserman A., Very deep convolutional networks for large-scale image recognition
  • [5] He K.M., Zhang X.Y., Ren S.Q., Et al., Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, (2016)
  • [6] Bahdanau D., Cho K., Bengio Y., Neural machine translation by jointly learning to align and translate
  • [7] Sun F., Qin K., Sun W., Et al., Image saliency detection based on region merging, Journal of Computer-Aided Design & Computer Graphics, 28, 10, pp. 1679-1687, (2016)
  • [8] Gao S., Zhang L., Li C., Et al., Image saliency detection via graph representation with fusing low-level and high-level features, Journal of Computer-Aided Design & Computer Graphics, 28, 3, pp. 420-426, (2016)
  • [9] You Q.Z., Jin H.L., Wang Z.W., Et al., Image captioning with semantic attention, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651-4659, (2016)
  • [10] Gu J.X., Cai J.F., Wang G., Et al., Stack-captioning: coarse-to-fine learning for image captioning, Proceedings of AAAI Conference on Artificial Intelligence, pp. 6837-6844, (2018)