A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

被引:17
作者
Sharma, Himanshu [1 ]
Padha, Devanand [1 ]
机构
[1] Cent Univ Jammu, Dept Comp Sci & Informat Technol, Jammu & Kashmir, Jammu 181124, India
关键词
Attention-based image captioning; Encoder-decoder architecture; Image captioning; Multimodal embedding; CONVOLUTIONAL NETWORKS; AUTOMATIC IMAGE; GENERATION; TRANSFORMER; RETRIEVAL; ATTENTION; SPEECH; LANGUAGE; DATABASE; MODELS;
D O I
10.1007/s10462-023-10488-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning is a pretty modern area of the convergence of computer vision and natural language processing and is widely used in a range of applications such as multi-modal search, robotics, security, remote sensing, medical, and visual aid. The image captioning techniques have witnessed a paradigm shift from classical machine-learning-based approaches to the most contemporary deep learning-based techniques. We present an in-depth investigation of image captioning methodologies in this survey using our proposed taxonomy. Furthermore, the study investigates several eras of image captioning advancements, including template-based, retrieval-based, and encoder-decoder-based models. We also explore captioning in languages other than English. A thorough investigation of benchmark image captioning datasets and assessment measures is also discussed. The effectiveness of real-time image captioning is a severe barrier that prevents its use in sensitive applications such as visual aid, security, and medicine. Another observation from our research is the scarcity of personalized domain datasets that limits its adoption into more advanced issues. Despite influential contributions from several academics, further efforts are required to construct substantially robust and reliable image captioning models.
引用
收藏
页码:13619 / 13661
页数:43
相关论文
共 146 条
[1]   Survey on Deep Neural Networks in Speech and Vision Systems [J].
Alam, M. ;
Samad, M. D. ;
Vidyaratne, L. ;
Glandon, A. ;
Iftekharuddin, K. M. .
NEUROCOMPUTING, 2020, 417 :302-321
[2]   Automatic Image and Video Caption Generation With Deep Learning: A Concise Review and Algorithmic Overlap [J].
Amirian, Soheyla ;
Rasheed, Khaled ;
Taha, Thiab R. ;
Arabnia, Hamid R. .
IEEE ACCESS, 2020, 8 :218386-218400
[3]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[4]   Convolutional Image Captioning [J].
Aneja, Jyoti ;
Deshpande, Aditya ;
Schwing, Alexander G. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5561-5570
[5]  
[Anonymous], 2014, P 3 JOINT C LEXICAL
[6]  
[Anonymous], 2015, P 2015 C N AM CHAPTE
[7]  
[Anonymous], 2014, Adv. Neural Inf.Process. Syst.
[8]  
[Anonymous], 2006, INT WORKSHOP ONTOIMA
[9]   A survey on automatic image caption generation [J].
Bai, Shuang ;
An, Shan .
NEUROCOMPUTING, 2018, 311 :291-304
[10]  
Banerjee S., 2005, P ACL WORKSH INTR EX, P65