A survey on automatic image caption generation

被引：120

作者：

Bai, Shuang ^{[1
]}

An, Shan ^{[2
]}

机构：

[1] Beijing Jiaotong Univ, Sch Elect & Informat Engn, 3 Shang Yuan Cun, Beijing, Peoples R China

[2] Beijing Jingdong Shangke Informat Technol Co Ltd, Beijing, Peoples R China

来源：

NEUROCOMPUTING | 2018年 / 311卷

基金：

中国国家自然科学基金;

关键词：

Image captioning; Sentence template; Deep neural networks; Multimodal embedding; Encoder-decoder framework; Attention mechanism; NEURAL-NETWORKS; DEEP; REPRESENTATION; SCENE;

D O I：

10.1016/j.neucom.2018.05.080

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image captioning means automatically generating a caption for an image. As a recently emerged research area, it is attracting more and more attention. To achieve the goal of image captioning, semantic information of images needs to be captured and expressed in natural languages. Connecting both research communities of computer vision and natural language processing, image captioning is a quite challenging task. Various approaches have been proposed to solve this problem. In this paper, we present a survey on advances in image captioning research. Based on the technique adopted, we classify image captioning approaches into different categories. Representative methods in each category are summarized, and their strengths and limitations are talked about. In this paper, we first discuss methods used in early work which are mainly retrieval and template based. Then, we focus our main attention on neural network based methods, which give state of the art results. Neural network based methods are further divided into subcategories based on the specific framework they use. Each subcategory of neural network based methods are discussed in detail. After that, state of the art methods are compared on benchmark datasets. Following that, discussions on future research directions are presented. (C) 2018 Elsevier B.V. All rights reserved.

引用

页码：291 / 304

页数：14

共 130 条

[91] DenseCap: Fully Convolutional Localization Networks for Dense Captioning
Johnson, Justin
Karpathy, Andrej
Fei-Fei, Li
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4565 - 4574
[92] Kalchbrenner N., 2013, P C EMP METH NAT LAN
[93] Karpathy A, 2014, ADV NEUR IN, V27
[94] Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932
[95] Koehn P., 2005, Europarl: A Parallel Corpus for Statistical Machine Translation
[96] Natural language description of human activities from video images based on concept hierarchy of actions
Kojima, A
Tamura, T
Fukunaga, K
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2002, 50 (02) : 171 - 184
[97] Krizhevsky A., 2017, COMMUN ACM, V60, P84, DOI [DOI 10.1145/3065386, 10.1145/3065386]
[98] BabyTalk: Understanding and Generating Simple Image Descriptions
Kulkarni, Girish
Premraj, Visruth
Ordonez, Vicente
Dhar, Sagnik
Li, Siming
Choi, Yejin
Berg, Alexander C.
Berg, Tamara L.
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (12) : 2891 - 2903
[99] Kuznetsova P., 2014, Transactions of the Association for Computational Linguistics, V2, P351
[100] Lampert CH, 2009, PROC CVPR IEEE, P951, DOI 10.1109/CVPRW.2009.5206594

← 4 5 6 7 8 9 10 11 12 13 →