KHATT: An open Arabic offline handwritten text database

被引:130
作者
Mahmoud, Sabri A. [1 ]
Ahmad, Irfan [1 ]
Al-Khati, Wasfi G. [1 ]
Alshayeb, Mohammad [1 ]
Parvez, Mohammad Tanvir [2 ]
Maergner, Volker [3 ]
Fink, Gernot A. [4 ]
机构
[1] King Fahd Univ Petr & Minerals, Dhahran 31261, Saudi Arabia
[2] Qassim Univ, Qasim 51477, Saudi Arabia
[3] Tech Univ Carolo Wilhelmina Braunschweig, D-38092 Braunschweig, Germany
[4] Tech Univ Dortmund, D-44227 Dortmund, Germany
关键词
Arabic handwritten text database; Arabic OCR; Document analysis; Form processing; WORD RECOGNITION; DURATION; LANGUAGE; SYSTEM; HMMS;
D O I
10.1016/j.patcog.2013.08.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A comprehensive Arabic handwritten text database is an essential resource for Arabic handwritten text recognition research. This is especially true due to the lack of such database for Arabic handwritten text. In this paper, we report our comprehensive Arabic offline Handwritten Text database (KHATT) consisting of 1000 handwritten forms written by 1000 distinct writers from different countries. The forms were scanned at 200, 300, and 600 dpi resolutions. The database contains 2000 randomly selected paragraphs from 46 sources, 2000 minimal text paragraph covering all the shapes of Arabic characters, and optionally written paragraphs on open subjects. The 2000 random text paragraphs consist of 9327 lines. The database forms were randomly divided into 70%, 15%, and 15% sets for training, testing, and verification, respectively. This enables researchers to use the database and compare their results. A formal verification procedure is implemented to align the handwritten text with its ground truth at the form, paragraph and line levels. The verified ground truth database contains meta-data describing the written text at the page, paragraph, and line levels in text and XML formats. Tools to extract paragraphs from pages and segment paragraphs into lines are developed. In addition we are presenting our experimental results on the database using two classifiers, viz. Hidden Markov Models (HMM) and our novel syntactic classifier. The database is made freely available to researchers world-wide for research in various handwritten-related problems such as text recognition, writer identification and verification, forms analysis, preprocessing, segmentation. Several international research groups/researchers acquired the database for use in their research so far. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1096 / 1112
页数:17
相关论文
共 59 条
[1]   RECOGNITION OF HANDWRITTEN CURSIVE ARABIC CHARACTERS [J].
ABUHAIBA, ISI ;
MAHMOUD, SA ;
GREEN, RJ .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1994, 16 (06) :664-672
[2]  
Al-Hajj R, 2007, PROC INT CONF DOC, P959
[3]  
Al-Maadeed S., 2002, P 8 INT WORKSH FRONT
[4]   Recognition of off-line printed Arabic text using Hidden Markov Models [J].
Al-Muhtaseb, Husni A. ;
Mahmoud, Sabri A. ;
Qahwaji, Rami S. .
SIGNAL PROCESSING, 2008, 88 (12) :2902-2912
[5]   Databases for recognition of handwritten Arabic cheques [J].
Al-Ohali, Y ;
Cheriet, M ;
Suen, C .
PATTERN RECOGNITION, 2003, 36 (01) :111-121
[6]  
Alamri H., 2008, 11 INT C FRONT HANDW
[7]   SURVEY AND BIBLIOGRAPHY OF ARABIC OPTICAL TEXT RECOGNITION [J].
ALBADR, B ;
MAHMOUD, SA .
SIGNAL PROCESSING, 1995, 41 (01) :49-77
[8]   Off-line recognition of handwritten Arabic words using multiple hidden Markov models [J].
Alma'adeed, S ;
Higgins, C ;
Elliman, D .
KNOWLEDGE-BASED SYSTEMS, 2004, 17 (2-4) :75-79
[9]   A METHOD OF RECOGNITION OF ARABIC CURSIVE HANDWRITING [J].
ALMUALLIM, H ;
YAMAGUCHI, S .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1987, 9 (05) :715-722
[10]   Hand-printed Arabic character recognition system using an artificial network [J].
Amin, A ;
AlSadoun, H ;
Fischer, S .
PATTERN RECOGNITION, 1996, 29 (04) :663-675