A Survey on Enhancing Image Captioning with Advanced Strategies and Techniques

被引:0
作者
Thobhani, Alaa [1 ]
Zou, Beiji [1 ]
Kui, Xiaoyan [1 ]
Abdussalam, Amr [2 ]
Asim, Muhammad [3 ]
Shah, Sajid [3 ]
Elaffendi, Mohammed [3 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Peoples R China
[2] Univ Sci & Technol China, Elect Engn & Informat Sci Dept, Hefei 230026, Peoples R China
[3] Prince Sultan Univ, Coll Comp & Informat Sci, EIAS Data Sci Lab, Riyadh 11586, Saudi Arabia
来源
CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES | 2025年 / 142卷 / 03期
基金
中国国家自然科学基金;
关键词
Image captioning; semantic attention; multi-caption; natural language processing; visual attention methods; AUTOMATIC IMAGE; GENERATION; ATTENTION; NETWORKS; SPEECH; MODELS;
D O I
10.32604/cmes.2025.059192
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Image captioning has seen significant research efforts over the last decade. The goal is to generate meaningful semantic sentences that describe visual content depicted in photographs and are syntactically accurate. Many real-world applications rely on image captioning, such as helping people with visual impairments to see their surroundings. To formulate a coherent and relevant textual description, computer vision techniques are utilized to comprehend the visual content within an image, followed by natural language processing methods. Numerous approaches and models have been developed to deal with this multifaceted problem. Several models prove to be stateof-the-art solutions in this field. This work offers an exclusive perspective emphasizing the most critical strategies and techniques for enhancing image caption generation. Rather than reviewing all previous image captioning work, we analyze various techniques that significantly improve image caption generation and achieve significant performance improvements, including encompassing image captioning with visual attention methods, exploring semantic information types in captions, and employing multi-caption generation techniques. Further, advancements such as neural architecture search, few-shot learning, multi-phase learning, and cross-modal embedding within image caption networks are examined for their transformative effects. The comprehensive quantitative analysis conducted in this study identifies cutting-edge methodologies and sheds light on their profound impact, driving forward the forefront of image captioning technology.
引用
收藏
页码:2247 / 2280
页数:34
相关论文
共 180 条
  • [1] NumCap: A Number-controlled Multi-caption Image Captioning Network
    Abdussalam, Amr
    Ye, Zhongfu
    Hawbani, Ammar
    Al-Qatf, Majjed
    Khan, Rashid
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (04)
  • [2] Ahmad S, 2024, Deep cognitive modelling in remote sensing image processing, P55
  • [3] RVAIC: Refined visual attention for improved image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, XingFu
    Abdusallam, Amr
    Alsamhi, Saeed
    Alhabib, Mohammed
    Curry, Edward
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 3447 - 3459
  • [4] NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, Xingfu
    Abdusallam, Amr
    Zhao, Liang
    Alsamhi, Saeed Hammod
    Curry, Edward
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 131
  • [5] Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-Weighting
    Al-Qatf, Majjed
    Wang, Xingfu
    Hawbani, Ammar
    Abdussalam, Amr
    Alsamhi, Saeed Hammod
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5984 - 5999
  • [6] Survey on Deep Neural Networks in Speech and Vision Systems
    Alam, M.
    Samad, M. D.
    Vidyaratne, L.
    Glandon, A.
    Iftekharuddin, K. M.
    [J]. NEUROCOMPUTING, 2020, 417 : 302 - 321
  • [7] FROM PIXELS TO PREDICTIONS: ROLE OF BOOSTED DEEP LEARNING-ENABLED OBJECT DETECTION FOR AUTONOMOUS VEHICLES ON LARGE SCALE CONSUMER ELECTRONICS ENVIRONMENT
    Alkhonaini, Mimouna Abdullah
    Mengash, Hanan Abdullah
    Nemri, Nadhem
    Ebad, Shouki A.
    Alotaibi, Faiz Abdullah
    Aljabri, Jawhara
    Alzahrani, Yazeed
    Alnfiai, Mrim M.
    [J]. FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY, 2024, 32 (09N10)
  • [8] Amirian S, 2019, P INT C IM PROC COMP, P10
  • [9] Automatic Image and Video Caption Generation With Deep Learning: A Concise Review and Algorithmic Overlap
    Amirian, Soheyla
    Rasheed, Khaled
    Taha, Thiab R.
    Arabnia, Hamid R.
    [J]. IEEE ACCESS, 2020, 8 (08): : 218386 - 218400
  • [10] Image Captioning with Generative Adversarial Network
    Amirian, Soheyla
    Rasheed, Khaled
    Taha, Thiab R.
    Arabnia, Hamid R.
    [J]. 2019 6TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI 2019), 2019, : 272 - 275