Character recognition system for pegon typed manuscript

被引:0
作者
Ruldeviyani, Yova [1 ]
Suhartanto, Heru [1 ]
Sotardodo, Beltsazar Anugrah [1 ]
Fahreza, Muhammad Hanif [1 ]
Septiano, Andre [1 ]
Rachmadi, Muhammad Febrian [1 ]
机构
[1] Univ Indonesia, Fac Comp Sci, Depok 16424, Jawa Barat, Indonesia
关键词
Arabic; Deep learning; Character recognition; Pegon; Segmentation; TEXT LINE DETECTION; LABELME;
D O I
10.1016/j.heliyon.2024.e35959
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The Pegon script is an Arabic-based writing system used for Javanese, Sundanese, Madurese, and Indonesian languages. Due to various reasons, this script is now mainly found among collectors and private Islamic boarding schools (pesantren), creating a need for its preservation. One preservation method is digitization through transcription into machine-encoded text, known as OCR (Optical Character Recognition). No published literature exists on OCR systems for this specific script. This research explores the OCR of Pegon typed manuscripts, introducing novel synthesized and real annotated datasets for this task. These datasets evaluate proposed OCR methods, especially those adapted from existing Arabic OCR systems. Results show that deep learning techniques outperform conventional ones, which fail to detect Pegon text. The proposed system uses YOLOv5 for line segmentation and a CTC-CRNN architecture for line text recognition, achieving an F1-score of 0.94 for segmentation and a CER of 0.03 for recognition.
引用
收藏
页数:18
相关论文
共 41 条
[1]  
Al-Sheikh I., 2019, P P 1 INT C INF ENG, DOI [10.4108/eai.18-7-2019.2287842, DOI 10.4108/EAI.18-7-2019.2287842]
[2]  
[Anonymous], 2023, texturedesign. texturize (Version 0.13)
[3]  
[Anonymous], 2023, tesseract-ocr. tesseract (Version 5.3.1)
[4]  
Antoun W., 2020, P 12 LANG RES EV C
[5]  
Anwar A.A.H.Z, 1974, Al-Mujarrabat Al-Kubra fii Dzikr Khawas Kalam Rabb Al-WariToha Putra
[6]  
Ayesh M, 2017, Electronic Imaging, V29, P42, DOI 10.2352/issn.2470-1173.2017.13.ipas-204
[7]  
Belval E., A synthetic data generator for text recognition
[8]   Disentangled Contour Learning for Quadrilateral Text Detection [J].
Bi, Yanguang ;
Hu, Zhiqiang .
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, :908-917
[9]   Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks [J].
Boillet, Melodie ;
Kermorvant, Christopher ;
Paquet, Thierry .
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, :2134-2141
[10]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)