A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

被引:17
作者
Sharma, Himanshu [1 ]
Padha, Devanand [1 ]
机构
[1] Cent Univ Jammu, Dept Comp Sci & Informat Technol, Jammu & Kashmir, Jammu 181124, India
关键词
Attention-based image captioning; Encoder-decoder architecture; Image captioning; Multimodal embedding; CONVOLUTIONAL NETWORKS; AUTOMATIC IMAGE; GENERATION; TRANSFORMER; RETRIEVAL; ATTENTION; SPEECH; LANGUAGE; DATABASE; MODELS;
D O I
10.1007/s10462-023-10488-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning is a pretty modern area of the convergence of computer vision and natural language processing and is widely used in a range of applications such as multi-modal search, robotics, security, remote sensing, medical, and visual aid. The image captioning techniques have witnessed a paradigm shift from classical machine-learning-based approaches to the most contemporary deep learning-based techniques. We present an in-depth investigation of image captioning methodologies in this survey using our proposed taxonomy. Furthermore, the study investigates several eras of image captioning advancements, including template-based, retrieval-based, and encoder-decoder-based models. We also explore captioning in languages other than English. A thorough investigation of benchmark image captioning datasets and assessment measures is also discussed. The effectiveness of real-time image captioning is a severe barrier that prevents its use in sensitive applications such as visual aid, security, and medicine. Another observation from our research is the scarcity of personalized domain datasets that limits its adoption into more advanced issues. Despite influential contributions from several academics, further efforts are required to construct substantially robust and reliable image captioning models.
引用
收藏
页码:13619 / 13661
页数:43
相关论文
共 146 条
[21]   Histograms of oriented gradients for human detection [J].
Dalal, N ;
Triggs, B .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893
[22]  
Devlin J, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, P100
[23]  
Dhir R, 2019, COMPUT SIST, V23, P693, DOI [10.13053/CyS-23-3-3269, 10.13053/cys-23-3-3269]
[24]   Long-Term Recurrent Convolutional Networks for Visual Recognition and Description [J].
Donahue, Jeff ;
Hendricks, Lisa Anne ;
Rohrbach, Marcus ;
Venugopalan, Subhashini ;
Guadarrama, Sergio ;
Saenko, Kate ;
Darrell, Trevor .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) :677-691
[25]   End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages [J].
Effendi, Johanes ;
Sakti, Sakriani ;
Nakamura, Satoshi .
IEEE ACCESS, 2021, 9 :55144-55154
[26]  
Elliott D., 2013, EMNLP, P1292
[27]  
Elliott D., 2015, ARXIV
[28]  
Elliott D, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1, P42
[29]  
Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754
[30]   Injecting Semantic Concepts into End-to-End Image Captioning [J].
Fang, Zhiyuan ;
Wang, Jianfeng ;
Hu, Xiaowei ;
Liang, Lin ;
Gan, Zhe ;
Wang, Lijuan ;
Yang, Yezhou ;
Liu, Zicheng .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :17988-17998