A Survey on Enhancing Image Captioning with Advanced Strategies and Techniques

被引：0

作者：

Thobhani, Alaa ^{[1
]}

Zou, Beiji ^{[1
]}

Kui, Xiaoyan ^{[1
]}

Abdussalam, Amr ^{[2
]}

Asim, Muhammad ^{[3
]}

Shah, Sajid ^{[3
]}

Elaffendi, Mohammed ^{[3
]}

机构：

[1] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Peoples R China

[2] Univ Sci & Technol China, Elect Engn & Informat Sci Dept, Hefei 230026, Peoples R China

[3] Prince Sultan Univ, Coll Comp & Informat Sci, EIAS Data Sci Lab, Riyadh 11586, Saudi Arabia

来源：

CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES | 2025年 / 142卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Image captioning; semantic attention; multi-caption; natural language processing; visual attention methods; AUTOMATIC IMAGE; GENERATION; ATTENTION; NETWORKS; SPEECH; MODELS;

D O I：

10.32604/cmes.2025.059192

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Image captioning has seen significant research efforts over the last decade. The goal is to generate meaningful semantic sentences that describe visual content depicted in photographs and are syntactically accurate. Many real-world applications rely on image captioning, such as helping people with visual impairments to see their surroundings. To formulate a coherent and relevant textual description, computer vision techniques are utilized to comprehend the visual content within an image, followed by natural language processing methods. Numerous approaches and models have been developed to deal with this multifaceted problem. Several models prove to be stateof-the-art solutions in this field. This work offers an exclusive perspective emphasizing the most critical strategies and techniques for enhancing image caption generation. Rather than reviewing all previous image captioning work, we analyze various techniques that significantly improve image caption generation and achieve significant performance improvements, including encompassing image captioning with visual attention methods, exploring semantic information types in captions, and employing multi-caption generation techniques. Further, advancements such as neural architecture search, few-shot learning, multi-phase learning, and cross-modal embedding within image caption networks are examined for their transformative effects. The comprehensive quantitative analysis conducted in this study identifies cutting-edge methodologies and sheds light on their profound impact, driving forward the forefront of image captioning technology.

引用

页码：2247 / 2280

页数：34

共 180 条

[1] NumCap: A Number-controlled Multi-caption Image Captioning Network
Abdussalam, Amr
Ye, Zhongfu
Hawbani, Ammar
Al-Qatf, Majjed
Khan, Rashid
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (04)
[2] Ahmad S, 2024, Deep cognitive modelling in remote sensing image processing, P55
[3] RVAIC: Refined visual attention for improved image captioning
Al-Qatf, Majjed
Hawbani, Ammar
Wang, XingFu
Abdusallam, Amr
Alsamhi, Saeed
Alhabib, Mohammed
Curry, Edward
[J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 3447 - 3459
[4] NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioning
Al-Qatf, Majjed
Hawbani, Ammar
Wang, Xingfu
Abdusallam, Amr
Zhao, Liang
Alsamhi, Saeed Hammod
Curry, Edward
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 131
[5] Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-Weighting
Al-Qatf, Majjed
Wang, Xingfu
Hawbani, Ammar
Abdussalam, Amr
Alsamhi, Saeed Hammod
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 5984 - 5999
[6] Survey on Deep Neural Networks in Speech and Vision Systems
Alam, M.
Samad, M. D.
Vidyaratne, L.
Glandon, A.
Iftekharuddin, K. M.
[J]. NEUROCOMPUTING, 2020, 417 : 302 - 321
[7] FROM PIXELS TO PREDICTIONS: ROLE OF BOOSTED DEEP LEARNING-ENABLED OBJECT DETECTION FOR AUTONOMOUS VEHICLES ON LARGE SCALE CONSUMER ELECTRONICS ENVIRONMENT
Alkhonaini, Mimouna Abdullah
Mengash, Hanan Abdullah
Nemri, Nadhem
Ebad, Shouki A.
Alotaibi, Faiz Abdullah
Aljabri, Jawhara
Alzahrani, Yazeed
Alnfiai, Mrim M.
[J]. FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY, 2024, 32 (09N10)
[8] Amirian S, 2019, P INT C IM PROC COMP, P10
[9] Automatic Image and Video Caption Generation With Deep Learning: A Concise Review and Algorithmic Overlap
Amirian, Soheyla
Rasheed, Khaled
Taha, Thiab R.
Arabnia, Hamid R.
[J]. IEEE ACCESS, 2020, 8 (08): : 218386 - 218400
[10] Image Captioning with Generative Adversarial Network
Amirian, Soheyla
Rasheed, Khaled
Taha, Thiab R.
Arabnia, Hamid R.
[J]. 2019 6TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI 2019), 2019, : 272 - 275

← 1 2 3 4 5 6 7 8 9 10 →