A comprehensive survey on deep-learning-based visual captioning

被引:1
作者
Xin, Bowen [1 ]
Xu, Ning [2 ]
Zhai, Yingchen [2 ]
Zhang, Tingting [2 ]
Lu, Zimu [2 ]
Liu, Jing [2 ]
Nie, Weizhi [2 ]
Li, Xuanya [3 ]
Liu, An-An [2 ,4 ]
机构
[1] Heilongjiang Univ, Sch Elect Engn, Harbin 150006, Heilongjiang, Peoples R China
[2] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[3] Baidu Inc, Beijing 100085, Peoples R China
[4] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230088, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual captioning; Deep learning; Survey; LONG-TERM; IMAGE; ALGORITHMS; NETWORK; VISION; GRAPH;
D O I
10.1007/s00530-023-01175-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Generating a description for an image/video is termed as the visual captioning task. It requires the model to capture the semantic information of visual content and translate them into syntactically and semantically human language. Connecting both research communities of computer vision (CV) and natural language processing (NLP), visual captioning presents the big challenge to bridge the gap between low-level visual features and high-level language information. Thanks to recent advances in deep learning, which are widely applied to the fields of visual and language modeling, the visual captioning methods depending on the deep neural networks has demonstrated state-of-the-art performances. In this paper, we aim to present a comprehensive survey of existing deep learning-based visual captioning methods. Relying on the adopted mechanism and technique to narrow the semantic gap, we divide visual captioning methods into various groups. Representative categories in each group are summarized, and their strengths and limitations are discussed. The quantitative evaluations of state-of-the-art approaches on popular benchmark datasets are also presented and analyzed. Furthermore, we provide the discussions on future research directions.
引用
收藏
页码:3781 / 3804
页数:24
相关论文
共 50 条
[31]   Video restoration based on deep learning: a comprehensive survey [J].
Claudio Rota ;
Marco Buzzelli ;
Simone Bianco ;
Raimondo Schettini .
Artificial Intelligence Review, 2023, 56 :5317-5364
[32]   An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation [J].
Michelsanti, Daniel ;
Tan, Zheng-Hua ;
Zhang, Shi-Xiong ;
Xu, Yong ;
Yu, Meng ;
Yu, Dong ;
Jensen, Jesper .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :1368-1396
[33]   Visual place recognition: A survey from deep learning perspective [J].
Zhang, Xiwu ;
Wang, Lei ;
Su, Yan .
PATTERN RECOGNITION, 2021, 113
[34]   Deep-Learning-Based Wireless Visual Sensor System for Shiitake Mushroom Sorting [J].
Deng, Junwen ;
Liu, Yuhang ;
Xiao, Xinqing .
SENSORS, 2022, 22 (12)
[35]   A survey of fine-grained visual categorization based on deep learning [J].
Xie Yuxiang ;
Gong Quanzhi ;
Luan Xidao ;
Yan Jie ;
Zhang Jiahui .
JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS, 2023,
[36]   Toward Specialized Learning-based Approaches for Visual Odometry: A Comprehensive Survey [J].
Phan, Thanh-Danh ;
Kim, Gon-Woo .
JOURNAL OF INTELLIGENT & ROBOTIC SYSTEMS, 2025, 111 (02)
[37]   A Survey of Visual Affordance Recognition Based on Deep Learning [J].
Chen, Dongpan ;
Kong, Dehui ;
Li, Jinghua ;
Wang, Shaofan ;
Yin, Baocai .
IEEE TRANSACTIONS ON BIG DATA, 2023, 9 (06) :1458-1476
[38]   Deep Learning Techniques for Visual SLAM: A Survey [J].
Mokssit, Saad ;
Licea, Daniel Bonilla ;
Guermah, Bassma ;
Ghogho, Mounir .
IEEE ACCESS, 2023, 11 :20026-20050
[39]   Deep Learning for Visual Speech Analysis: A Survey [J].
Sheng, Changchong ;
Kuang, Gangyao ;
Bai, Liang ;
Hou, Chenping ;
Guo, Yulan ;
Xu, Xin ;
Pietikainen, Matti ;
Liu, Li .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (09) :6001-6022
[40]   Deep learning-based perception systems for autonomous driving: A comprehensive survey [J].
Wen, Li-Hua ;
Jo, Kang-Hyun .
NEUROCOMPUTING, 2022, 489 :255-270