A comprehensive survey on image captioning: from handcrafted to deep learning-based techniques, a taxonomy and open research issues

被引:17
作者
Sharma, Himanshu [1 ]
Padha, Devanand [1 ]
机构
[1] Cent Univ Jammu, Dept Comp Sci & Informat Technol, Jammu & Kashmir, Jammu 181124, India
关键词
Attention-based image captioning; Encoder-decoder architecture; Image captioning; Multimodal embedding; CONVOLUTIONAL NETWORKS; AUTOMATIC IMAGE; GENERATION; TRANSFORMER; RETRIEVAL; ATTENTION; SPEECH; LANGUAGE; DATABASE; MODELS;
D O I
10.1007/s10462-023-10488-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning is a pretty modern area of the convergence of computer vision and natural language processing and is widely used in a range of applications such as multi-modal search, robotics, security, remote sensing, medical, and visual aid. The image captioning techniques have witnessed a paradigm shift from classical machine-learning-based approaches to the most contemporary deep learning-based techniques. We present an in-depth investigation of image captioning methodologies in this survey using our proposed taxonomy. Furthermore, the study investigates several eras of image captioning advancements, including template-based, retrieval-based, and encoder-decoder-based models. We also explore captioning in languages other than English. A thorough investigation of benchmark image captioning datasets and assessment measures is also discussed. The effectiveness of real-time image captioning is a severe barrier that prevents its use in sensitive applications such as visual aid, security, and medicine. Another observation from our research is the scarcity of personalized domain datasets that limits its adoption into more advanced issues. Despite influential contributions from several academics, further efforts are required to construct substantially robust and reliable image captioning models.
引用
收藏
页码:13619 / 13661
页数:43
相关论文
共 146 条
[11]   Unpaired Image Captioning With semantic-Constrained Self-Learning [J].
Ben, Huixia ;
Pan, Yingwei ;
Li, Yehao ;
Yao, Ting ;
Hong, Richang ;
Wang, Meng ;
Mei, Tao .
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 :904-916
[12]   Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures [J].
Bernardi, Raffaella ;
Cakici, Ruket ;
Elliott, Desmond ;
Erdem, Aykut ;
Erdem, Erkut ;
Ikizler-Cinbis, Nazli ;
Keller, Frank ;
Muscat, Adrian ;
Plank, Barbara .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2016, 55 :409-442
[13]  
Bhosale Yogesh H., 2022, 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS), P1398, DOI 10.1109/ICACCS54159.2022.9785113
[14]  
Bhosale YH., 2022, 2022 INT C IOT BLOCK, DOI DOI 10.1109/ICIBT52874.2022.9807725
[15]   Application of Deep Learning Techniques in Diagnosis of Covid-19 (Coronavirus): A Systematic Review [J].
Bhosale, Yogesh H. ;
Patnaik, K. Sridhar .
NEURAL PROCESSING LETTERS, 2023, 55 (03) :3551-3603
[16]   Entity Slot Filling for Visual Captioning [J].
Bin, Yi ;
Ding, Yujuan ;
Peng, Bo ;
Peng, Liang ;
Yang, Yang ;
Chua, Tat-Seng .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) :52-62
[17]  
Chen X, 2015, PROC CVPR IEEE, P2422, DOI 10.1109/CVPR.2015.7298856
[18]   A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing [J].
Cheng, Qimin ;
Zhou, Yuzhuo ;
Fu, Peng ;
Xu, Yuan ;
Zhang, Liang .
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 :4284-4297
[19]   Meshed-Memory Transformer for Image Captioning [J].
Cornia, Marcella ;
Stefanini, Matteo ;
Baraldi, Lorenzo ;
Cucchiara, Rita .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10575-10584
[20]   Towards Diverse and Natural Image Descriptions via a Conditional GAN [J].
Dai, Bo ;
Fidler, Sanja ;
Urtasun, Raquel ;
Lin, Dahua .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2989-2998