A real-time image captioning framework using computer vision to help the visually impaired

被引:5
作者
Safiya, K. M. [1 ]
Pandian, R. [2 ]
机构
[1] Sathyabama Inst Sci & Technol Deemed to Be Univ, Dept Comp Sci & Engn, Chennai, India
[2] Sathyabama Inst Sci & Technol Deemed to be Univ, Dept Elect & Commun Engn, Chennai, India
关键词
Artificial intelligence; Computer Vision; Text-to-Speech; Image Captioning; LSTM; VGG16; Visually impaired; LSTM;
D O I
10.1007/s11042-023-17849-7
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Advancements in image captioning technology have played a pivotal role in enhancing the quality of life for those with visual impairments, fostering greater social inclusivity. The computer vision and natural language processing methods enhances the accessibility and comprehensibility of pictures via the addition of textual descriptions. Significant advancements have been achieved in photo captioning, specifically tailored for those with visual impairments. Nevertheless, some challenges must be addressed, like ensuring the precision of automatically generated captions and effectively handling pictures that include many objects or settings. This research presents a ground breaking architecture for real-time picture captioning using a VGG16-LSTM deep learning model with computer vision assistance. The framework has been developed and deployed in a Raspberry Pi 4B single-board computer, with graphics processing unit capabilities. This implementation allows for the automated generation of relevant captions for photographs captured in real time by a NoIR camera module. This characteristic makes it a portable and uncomplicated choice for those with visual impairments. The efficacy of the VGG16-LSTM deep learning model is evaluated via comprehensive testing, including both sighted and visually impaired participants in diverse settingsThe experimental findings demonstrate that the proposed framework effectively operates as intended, generating real-time picture captions that are accurate and contextually appropriate. The analysis of user feedback indicates a significant improvement in the understanding of visual content, hence facilitating the mobility and interaction of individuals with visual impairments in their environment. We have used multiple dataset including Flicke8k, Flickr30k, VizWiz captioning and custom dataset for the model training, validation and testing process. During the training phase, the ResNet-50 and VGG-16 models achieve 80.84% and 84.13% accuracy, respectively. Similarly, during the validation phase, the ResNet-50 and VGG-16 models acquire accuracies of 80.04% and 83.88%, respectively. The text-to-speech API is analyzed with MOS and WER matrices and achieved an exceptional accuracy and performance is verifying on a GPU system using a custom dataset. The efficacy of the VGG16-LSTM deep-learning model is evaluated using six metrics: accuracy, precision, recall, F1 score, BLEU, and ROUGE. Individuals with visual impairments may benefit from this deep learning architecture, as it endeavors to facilitate their comprehension and engagement with visual content.
引用
收藏
页码:59413 / 59438
页数:26
相关论文
共 43 条
[1]   A Hand Hygiene Tracking System With LoRaWAN Network for the Abolition of Hospital-Acquired Infections [J].
Abubeker, K. M. ;
Baskar, S. .
IEEE SENSORS JOURNAL, 2023, 23 (07) :7608-7615
[2]   B2-Net: an artificial intelligence powered machine learning framework for the classification of pneumonia in chest x-ray images [J].
Abubeker, K. M. ;
Baskar, S. .
MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2023, 4 (01)
[3]  
Afzal Muhammad Kashif, 2023, Journal of Ambient Intelligence and Humanized Computing, P7719, DOI [10.1007/s12652-023-04584-y, 10.1007/s12652-023-04584-y]
[4]  
[Anonymous], 2020, Kaggle
[5]  
[Anonymous], 2018, Kaggle
[6]   Embedded implementation of an obstacle detection system for blind and visually impaired persons' assistance navigation [J].
Ben Atitallah, Ahmed ;
Said, Yahia ;
Ben Atitallah, Mohamed Amin ;
Albekairi, Mohammed ;
Kaaniche, Khaled ;
Alanazi, Turki M. ;
Boubaker, Sahbi ;
Atri, Mohamed .
COMPUTERS & ELECTRICAL ENGINEERING, 2023, 108
[7]   View-target relation-guided unsupervised 2D image-based 3D model retrieval via transformer [J].
Chang, Jiacheng ;
Zhang, Lanyong ;
Shao, Zhuang .
MULTIMEDIA SYSTEMS, 2023, 29 (06) :3891-3901
[8]   Automatic Image Captioning Based on ResNet50 and LSTM with Soft Attention [J].
Chu, Yan ;
Yue, Xiao ;
Yu, Lei ;
Sergei, Mikhailov ;
Wang, Zhengkui .
WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2020, 2020
[9]   Assamese news image caption generation using attention mechanism [J].
Das, Ringki ;
Singh, Thoudam Doren .
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (07) :10051-10069
[10]   Artificial Intelligence of Things Applied to Assistive Technology: A Systematic Literature Review [J].
de Freitas, Mauricio Pasetto ;
Piai, Vinicius Aquino ;
Farias, Ricardo Heffel ;
Fernandes, Anita M. R. ;
de Moraes Rossetto, Anubis Graciela ;
Quietinho Leithardt, Valderi Reis .
SENSORS, 2022, 22 (21)