A comprehensive survey on deep-learning-based visual captioning

被引:1
作者
Xin, Bowen [1 ]
Xu, Ning [2 ]
Zhai, Yingchen [2 ]
Zhang, Tingting [2 ]
Lu, Zimu [2 ]
Liu, Jing [2 ]
Nie, Weizhi [2 ]
Li, Xuanya [3 ]
Liu, An-An [2 ,4 ]
机构
[1] Heilongjiang Univ, Sch Elect Engn, Harbin 150006, Heilongjiang, Peoples R China
[2] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[3] Baidu Inc, Beijing 100085, Peoples R China
[4] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230088, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual captioning; Deep learning; Survey; LONG-TERM; IMAGE; ALGORITHMS; NETWORK; VISION; GRAPH;
D O I
10.1007/s00530-023-01175-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Generating a description for an image/video is termed as the visual captioning task. It requires the model to capture the semantic information of visual content and translate them into syntactically and semantically human language. Connecting both research communities of computer vision (CV) and natural language processing (NLP), visual captioning presents the big challenge to bridge the gap between low-level visual features and high-level language information. Thanks to recent advances in deep learning, which are widely applied to the fields of visual and language modeling, the visual captioning methods depending on the deep neural networks has demonstrated state-of-the-art performances. In this paper, we aim to present a comprehensive survey of existing deep learning-based visual captioning methods. Relying on the adopted mechanism and technique to narrow the semantic gap, we divide visual captioning methods into various groups. Representative categories in each group are summarized, and their strengths and limitations are discussed. The quantitative evaluations of state-of-the-art approaches on popular benchmark datasets are also presented and analyzed. Furthermore, we provide the discussions on future research directions.
引用
收藏
页码:3781 / 3804
页数:24
相关论文
共 50 条
  • [21] Deep-learning-based visual data analytics for smart construction management
    Pal, Aritra
    Hsieh, Shang-Hsien
    AUTOMATION IN CONSTRUCTION, 2021, 131
  • [22] Survey of deep learning and architectures for visual captioning-transitioning between media and natural languages
    Sur, Chiranjib
    MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 32187 - 32237
  • [23] Deep-Learning-Based Semantic Segmentation of Remote Sensing Images: A Survey
    Huang, Liwei
    Jiang, Bitao
    Lv, Shouye
    Liu, Yanbo
    Fu, Ying
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 8370 - 8396
  • [24] Toward Deep-Learning-Based Methods in Image Forgery Detection: A Survey
    Pham, Nam Thanh
    Park, Chun-Su
    IEEE ACCESS, 2023, 11 : 11224 - 11237
  • [25] Survey of Visual SLAM Based on Deep Learning
    Huang Z.
    Shao C.
    Jiqiren/Robot, 2023, 45 (06): : 756 - 768
  • [26] Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive Survey
    Xu, Yuecong
    Cao, Haozhi
    Xie, Lihua
    Li, Xiao-Li
    Chen, Zhenghua
    Yang, Jianfei
    ACM COMPUTING SURVEYS, 2024, 56 (12)
  • [27] A comprehensive literature review on image captioning methods and metrics based on deep learning technique
    Ahmad Sami Al-Shamayleh
    Omar Adwan
    Mohammad A. Alsharaiah
    Abdelrahman H. Hussein
    Qasem M. Kharma
    Christopher Ifeanyi Eke
    Multimedia Tools and Applications, 2024, 83 : 34219 - 34268
  • [28] Video super-resolution based on deep learning: a comprehensive survey
    Liu, Hongying
    Ruan, Zhubo
    Zhao, Peng
    Dong, Chao
    Shang, Fanhua
    Liu, Yuanyuan
    Yang, Linlin
    Timofte, Radu
    ARTIFICIAL INTELLIGENCE REVIEW, 2022, 55 (08) : 5981 - 6035
  • [29] A Comprehensive Survey of Recommender Systems Based on Deep Learning
    Zhou, Hongde
    Xiong, Fei
    Chen, Hongshu
    APPLIED SCIENCES-BASEL, 2023, 13 (20):
  • [30] An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Zhang, Shi-Xiong
    Xu, Yong
    Yu, Meng
    Yu, Dong
    Jensen, Jesper
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1368 - 1396