A transformer based real-time photo captioning framework for visually impaired people with visual attention

被引:0
作者
Muhammed Kunju A.K. [1 ]
Baskar S. [2 ]
Zafar S. [3 ]
A R B. [4 ]
S R. [5 ]
A S.K. [2 ]
机构
[1] Department of Electronics and Communication Engineering, Amal Jyothi College of Engineering (Autonomous), Kerala, Kanjirapally
[2] Faculty of Engineering, Department of Electronics and Communication Engineering, Karpagam Academy of Higher Education, Coimbatore
[3] Department of Computer and Electronics Engineering, SEST- Jamia Hamdard, New Delhi
[4] Department of Electronics and Communication Engineering, KMEA College of Engineering, Kerala, Cochin
[5] Department of Computer Science and Engineering, V.S.B College of Engineering Technical Campus, Coimbatore
关键词
Computer vision; Natural language processing; Photo captioning; Two-layer tranformer; Visually impaired;
D O I
10.1007/s11042-024-18966-7
中图分类号
学科分类号
摘要
In recent years, transformer-based photo captioning frameworks plays a crucial role in improving individuals’ overall well-being, self-reliance, and inclusivity by giving them access to visual content via written and voiced explanations. This research investigates a two-layer transformer architecture to capture extensive relationships in visual data effectively. Using a two-layer transformer design, visual attention mechanism enables the model to autonomously identify relevant regions within a picture while generating a particular word within the anticipated caption. To get the features from the images, a pre-trained Inception V3 model is used, and the feature maps are extracted from the final convolutional block. The dataset used in this research was Flick8k, Flick30k, among bespoke datasets. The model was trained on a TPU platform and implemented on a Raspberry Pi 4B graphics processing unit. To assess the efficacy of the two-layer transformer model for photo captioning, a series of experiments were undertaken using benchmark datasets and with state-of-the-art models. The quality of the generated captions was evaluated using quantitative criteria such as BLEU, BLEUR, ROUGE and F1 score. Furthermore, a series of qualitative assessments were conducted, employing human evaluators, to measure the overall coherence, and fluency of the captions. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
收藏
页码:88859 / 88878
页数:19
相关论文
共 39 条
  • [1] Rinaldi A.M., Russo C., Tommasino C., Automatic image captioning combining natural language processing and deep neural networks, Results Eng, 18, (2023)
  • [2] Zhang F., Et al., Dual-task attention-guided character image generation method, J Intell Fuzzy Syst, 45, 3, pp. 4725-4735, (2023)
  • [3] Wang J., Wang S., Zhang Y., Artificial intelligence for visually impaired, Displays, 77, (2023)
  • [4] Walle H., De Runz C., Serres B., Venturini G., A survey on recent advances in AI and vision-based methods for helping and guiding visually impaired people, Appl Sci, 12, 5, (2021)
  • [5] Kulkarni C., Monika P., Preeti B., Shruthi S., A novel framework for automatic caption and audio generation, Mater Today: Proc, 65, pp. 3248-3252, (2021)
  • [6] Masud U., Saeed T., Malaikah H.M., Islam F.U., Abbas G., Smart assistive system for visually impaired people obstruction avoidance through object detection and classification, IEEE Access, 10, pp. 13428-13441, (2022)
  • [7] Ben Atitallah A., Said Y., Ben Atitallah M.A., Albekairi M., Kaaniche K., Boubaker S., An effective obstacle detection system using deep learning advantages to aid blind and visually impaired navigation, Ain Shams Eng J, (2023)
  • [8] Hu J., Yang Y., An Y., Yao L., Dual-spatial normalized transformer for image captioning, Eng Appl Artif Intell, 123, (2023)
  • [9] Wang Q., Deng H., Wu X., Yang Z., Liu Y., Wang Y., Hao G., LCM-Captioner: a lightweight text-based image captioning method with collaborative mechanism between vision and text, Neural Netw, 162, pp. 318-329, (2023)
  • [10] Alashhab S., Gallego A.J., Lozano M.A., Efficient gesture recognition for the assistance of visually impaired people using multi-head neural networks, Eng Appl Artif Intell, 114, (2022)