Improved Tesseract optical character recognition performance on Thai document datasets

被引:0
作者
Anakpluek, Noppol [1 ]
Pasanta, Watcharakorn [1 ]
Chantharasukha, Latthawan [1 ]
Chokratansombat, Pattanawong [1 ]
Kanjanakaew, Pajaya [1 ]
Siriborvornratanakul, Thitirat [1 ]
机构
[1] Natl Inst Dev Adm, Grad Sch Appl Stat, 148 SeriThai Rd,Klong Chan, Bangkok 10240, Thailand
关键词
Optical character recognition (OCR); Image processing; Thai language;
D O I
10.1016/j.bdr.2025.100508
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This research aims to improve the accuracy and efficiency of Optical Character Recognition (OCR) technology for the Thai language, specifically in the context of Thai government documents. OCR enables the conversion of text from images into machine-readable format, facilitating document storage and further processing. However, applying OCR to the Thai language presents unique challenges due to its complexity. This study focuses on enhancing the performance of the Tesseract OCR engine, a widely used free OCR technology, by implementing various image preprocessing techniques such as masking, adaptive thresholds, median filtering, Canny edge detection, and morphological operators. A dataset of Thai documents is utilized, and the OCR system's output is evaluated using word error rate (WER) and character error rate (CER) metrics. To improve text extraction accuracy, the research employs the original U-Net architecture [19] for image segmentation. Furthermore, the Tesseract OCR engine is finetuned, and image preprocessing is performed to optimize OCR system accuracy. The developed tools automate workflow processes, alleviate constraints on model training, and enable the effective utilization of information from official Thai documents for various purposes.
引用
收藏
页数:8
相关论文
共 25 条
  • [1] Burkpalli V., 2022, Int. Res. J. Moderniz. Eng. Technol. Sci., V4
  • [2] Chawla A., 2022, J. Emerg. Technol. Innov. Res. (JETIR), V9
  • [3] Lung computed tomography image segmentation based on U-Net network fused with dilated convolution
    Chen, Kuan-bing
    Xuan, Ying
    Lin, Ai-jun
    Guo, Shao-hua
    [J]. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2021, 207
  • [4] Dias C, 2019, 2019 2ND INTERNATIONAL CONFERENCE ON INTELLIGENT COMMUNICATION AND COMPUTATIONAL TECHNIQUES (ICCT), P79, DOI [10.1109/icct46177.2019.8969068, 10.1109/ICCT46177.2019.8969068]
  • [5] Geetha V., 2022, J. Eng. Comput. Architec., V12
  • [6] Which OCR toolset is good and why? A comparative study
    Jain, Pooja
    Taneja, Kavita
    Taneja, Harmunish
    [J]. KUWAIT JOURNAL OF SCIENCE, 2021, 48 (02)
  • [7] Jaume G, 2019, Arxiv, DOI arXiv:1905.13538
  • [8] Kumar A., 2023, INT C INN DAT COMM T
  • [9] A novel stock counting system for detecting lot numbers using Tesseract OCR
    Lertsawatwicha P.
    Phathong P.
    Tantasanee N.
    Sarawutthinun K.
    Siriborvornratanakul T.
    [J]. International Journal of Information Technology, 2023, 15 (1) : 393 - 398
  • [10] Majeed A, 2024, Arxiv, DOI arXiv:2408.13631