Improved Tesseract optical character recognition performance on Thai document datasets

被引：0

作者：

Anakpluek, Noppol ^{[1
]}

Pasanta, Watcharakorn ^{[1
]}

Chantharasukha, Latthawan ^{[1
]}

Chokratansombat, Pattanawong ^{[1
]}

Kanjanakaew, Pajaya ^{[1
]}

Siriborvornratanakul, Thitirat ^{[1
]}

机构：

[1] Natl Inst Dev Adm, Grad Sch Appl Stat, 148 SeriThai Rd,Klong Chan, Bangkok 10240, Thailand

来源：

BIG DATA RESEARCH | 2025年 / 39卷

关键词：

Optical character recognition (OCR); Image processing; Thai language;

D O I：

10.1016/j.bdr.2025.100508

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This research aims to improve the accuracy and efficiency of Optical Character Recognition (OCR) technology for the Thai language, specifically in the context of Thai government documents. OCR enables the conversion of text from images into machine-readable format, facilitating document storage and further processing. However, applying OCR to the Thai language presents unique challenges due to its complexity. This study focuses on enhancing the performance of the Tesseract OCR engine, a widely used free OCR technology, by implementing various image preprocessing techniques such as masking, adaptive thresholds, median filtering, Canny edge detection, and morphological operators. A dataset of Thai documents is utilized, and the OCR system's output is evaluated using word error rate (WER) and character error rate (CER) metrics. To improve text extraction accuracy, the research employs the original U-Net architecture [19] for image segmentation. Furthermore, the Tesseract OCR engine is finetuned, and image preprocessing is performed to optimize OCR system accuracy. The developed tools automate workflow processes, alleviate constraints on model training, and enable the effective utilization of information from official Thai documents for various purposes.

引用

页数：8

共 25 条

[1] Burkpalli V., 2022, Int. Res. J. Moderniz. Eng. Technol. Sci., V4
[2] Chawla A., 2022, J. Emerg. Technol. Innov. Res. (JETIR), V9
[3] Lung computed tomography image segmentation based on U-Net network fused with dilated convolution
Chen, Kuan-bing
Xuan, Ying
Lin, Ai-jun
Guo, Shao-hua
[J]. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2021, 207
[4] Dias C, 2019, 2019 2ND INTERNATIONAL CONFERENCE ON INTELLIGENT COMMUNICATION AND COMPUTATIONAL TECHNIQUES (ICCT), P79, DOI [10.1109/icct46177.2019.8969068, 10.1109/ICCT46177.2019.8969068]
[5] Geetha V., 2022, J. Eng. Comput. Architec., V12
[6] Which OCR toolset is good and why? A comparative study
Jain, Pooja
Taneja, Kavita
Taneja, Harmunish
[J]. KUWAIT JOURNAL OF SCIENCE, 2021, 48 (02)
[7] Jaume G, 2019, Arxiv, DOI arXiv:1905.13538
[8] Kumar A., 2023, INT C INN DAT COMM T
[9] A novel stock counting system for detecting lot numbers using Tesseract OCR
Lertsawatwicha P.
Phathong P.
Tantasanee N.
Sarawutthinun K.
Siriborvornratanakul T.
[J]. International Journal of Information Technology, 2023, 15 (1) : 393 - 398
[10] Majeed A, 2024, Arxiv, DOI arXiv:2408.13631

← 1 2 3 →