Multilingual character recognition dataset for Moroccan official documents

被引：0

作者：

Benaissa, Ali ^{[1
,3
]}

Bahri, Abdelkhalak ^{[1
]}

El Allaoui, Ahmad ^{[2
]}

机构：

[1] Abdelmalek Essaadi Univ UAE, Data Sci & Competit Intelligence Team DSCI, ENSAH, Tetouan, Morocco

[2] Moulay Ismail Univ Meknes, Fac Sci & Tech Errachidia, Engn Sci & Tech Lab, Decis Comp & Syst Modelling Team, Meknes, Morocco

[3] Abdelmalek Essaadi Univ, Natl Sch Management, Governance & Performance Org Lab, Finance & Governance Org Team, Tangier, Morocco

来源：

DATA IN BRIEF | 2024年 / 52卷

关键词：

OCR dataset; Character recognition; Printed characters; Documents digitization; Moroccan documents; Moroccan characters images;

D O I：

10.1016/j.dib.2023.109953

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

This article focuses on the construction of a dataset for multilingual character recognition in Moroccan official doc-uments. The dataset covers languages such as Arabic, French, and Tamazight and are built programmatically to ensure data diversity. It consists of sub-datasets such as Upper-case alphabet (26 classes), Lowercase alphabet (26 classes), Digits (9 classes), Arabic (28 classes), Tifinagh letters (33 classes), Symbols (14 classes), and French special characters (16 classes). The dataset construction process involves col-lecting representative fonts and generating multiple charac-ter images using a Python script, presenting a comprehensive variety essential for robust recognition models. Moreover, this dataset contributes to the digitization of these diverse official documents and archival papers, essential for preserving cultural heritage and enabling advanced text recognition technologies. The need for this work arises from the advance-ments in character recognition techniques and the significance of large-scale annotated datasets. The proposed dataset contributes to the development of robust character recognition models for practical applications.

引用

页数：7