KHATT: An open Arabic offline handwritten text database

被引:130
作者
Mahmoud, Sabri A. [1 ]
Ahmad, Irfan [1 ]
Al-Khati, Wasfi G. [1 ]
Alshayeb, Mohammad [1 ]
Parvez, Mohammad Tanvir [2 ]
Maergner, Volker [3 ]
Fink, Gernot A. [4 ]
机构
[1] King Fahd Univ Petr & Minerals, Dhahran 31261, Saudi Arabia
[2] Qassim Univ, Qasim 51477, Saudi Arabia
[3] Tech Univ Carolo Wilhelmina Braunschweig, D-38092 Braunschweig, Germany
[4] Tech Univ Dortmund, D-44227 Dortmund, Germany
关键词
Arabic handwritten text database; Arabic OCR; Document analysis; Form processing; WORD RECOGNITION; DURATION; LANGUAGE; SYSTEM; HMMS;
D O I
10.1016/j.patcog.2013.08.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A comprehensive Arabic handwritten text database is an essential resource for Arabic handwritten text recognition research. This is especially true due to the lack of such database for Arabic handwritten text. In this paper, we report our comprehensive Arabic offline Handwritten Text database (KHATT) consisting of 1000 handwritten forms written by 1000 distinct writers from different countries. The forms were scanned at 200, 300, and 600 dpi resolutions. The database contains 2000 randomly selected paragraphs from 46 sources, 2000 minimal text paragraph covering all the shapes of Arabic characters, and optionally written paragraphs on open subjects. The 2000 random text paragraphs consist of 9327 lines. The database forms were randomly divided into 70%, 15%, and 15% sets for training, testing, and verification, respectively. This enables researchers to use the database and compare their results. A formal verification procedure is implemented to align the handwritten text with its ground truth at the form, paragraph and line levels. The verified ground truth database contains meta-data describing the written text at the page, paragraph, and line levels in text and XML formats. Tools to extract paragraphs from pages and segment paragraphs into lines are developed. In addition we are presenting our experimental results on the database using two classifiers, viz. Hidden Markov Models (HMM) and our novel syntactic classifier. The database is made freely available to researchers world-wide for research in various handwritten-related problems such as text recognition, writer identification and verification, forms analysis, preprocessing, segmentation. Several international research groups/researchers acquired the database for use in their research so far. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:1096 / 1112
页数:17
相关论文
共 59 条
[41]  
Marti UV, 2000, INT C PATT RECOG, P463, DOI 10.1109/ICPR.2000.903584
[42]   Combining Slanted-Frame Classifiers for Improved HMM-Based Arabic Handwriting Recognition [J].
Mohamad, Ramy Al-Hajj ;
Likforman-Sulem, Laurence ;
Mokbel, Chafic .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2009, 31 (07) :1165-1177
[43]  
Natarajan Prem, 2009, 2009 10th International Conference on Document Analysis and Recognition (ICDAR), P971, DOI 10.1109/ICDAR.2009.278
[44]   THRESHOLD SELECTION METHOD FROM GRAY-LEVEL HISTOGRAMS [J].
OTSU, N .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1979, 9 (01) :62-66
[45]   A FUZZY-SYNTACTIC APPROACH TO ALLOGRAGH MODELING FOR CURSIVE SCRIPT RECOGNITION [J].
PARIZEAU, M ;
PLAMONDON, R .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1995, 17 (07) :702-712
[46]  
Parvez M. T., 2010, THESIS KING FAHD U P
[47]  
Parvez M.T., 2010, 1 INT WORKSHOP FRONT, P9
[48]   Offline Arabic Handwritten Text Recognition: A Survey [J].
Parvez, Mohammad Tanvir ;
Mahmoud, Sabri A. .
ACM COMPUTING SURVEYS, 2013, 45 (02)
[49]   Arabic handwriting recognition using structural and syntactic pattern attributes [J].
Parvez, Mohammad Tanvir ;
Mahmoud, Sabri A. .
PATTERN RECOGNITION, 2013, 46 (01) :141-154
[50]   Polygonal approximation of digital planar curves through adaptive optimizations [J].
Parvez, Mohammad Tanvir ;
Mahmoud, Sabri A. .
PATTERN RECOGNITION LETTERS, 2010, 31 (13) :1997-2005