HAH manuscripts: A holistic paradigm for classifying and retrieving historical Arabic handwritten documents

被引:19
作者
Al Aghbari, Zaher [1 ]
Brook, Salama [1 ]
机构
[1] Univ Sharjah, Dept Comp Sci, Sharjah, U Arab Emirates
关键词
Historical Arabic handwriting; Data mining of Arabic documents; Word recognition; Segmentation of historical Arabic handwritten documents; Feature extraction of Arabic text; CHARACTER-RECOGNITION;
D O I
10.1016/j.eswa.2009.02.024
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Technologies for reading and searching digital documents have helped academic researchers; however, truly effective search engines for handwritten documents have not been developed. Recently, there is a growing need to access historical Arabic handwritten manuscripts (HAH manuscripts) that are stored in large archives; therefore, managing tools for automatic searching, indexing, classifying and retrieval of HAH manuscripts are required. The peculiar characteristics of Arabic handwriting have added an extra challenging dimension in developing such systems. This paper presents a novel holistic technique for classifying and retrieving HAH manuscripts. The classification of HAH manuscripts is performed in several steps. First, the HAH manuscript's image is segmented into words, and then each word is segmented into its connected parts. Due to the existing overlap between the adjacent connected parts of a single word, we developed a stretching algorithm to increase the gap between them and thus improve their segmentation. Second, several structural and statistical features, which are devised for Arabic text, are extracted from these connected parts and then combined to represent a word with one consolidated feature vector. Finally, a neural network is used to learn and classify the input vectors into word classes. These classes are then utilized to retrieve HAH manuscripts. The extraction of structural and statistical features from the individual connected parts, as compared to the extraction of these features from the whole word, improved the performance of the system significantly. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:10942 / 10951
页数:10
相关论文
共 42 条
[1]  
ABDELAZIM H, 1988, 10 NAT COMP C
[2]   RESTORATION OF TEMPORAL INFORMATION IN OFF-LINE ARABIC HANDWRITING [J].
ABUHAIBA, ISI ;
AHMED, P .
PATTERN RECOGNITION, 1993, 26 (07) :1009-1017
[3]   RECOGNITION OF HANDWRITTEN CURSIVE ARABIC CHARACTERS [J].
ABUHAIBA, ISI ;
MAHMOUD, SA ;
GREEN, RJ .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1994, 16 (06) :664-672
[4]   SURVEY AND BIBLIOGRAPHY OF ARABIC OPTICAL TEXT RECOGNITION [J].
ALBADR, B ;
MAHMOUD, SA .
SIGNAL PROCESSING, 1995, 41 (01) :49-77
[5]  
ALYOUSEF H, 1992, IEEE T PATTERN ANAL, V14
[6]  
AMIN A, 1996, PATTERN RECOGNITION, V29
[7]  
AMIN A, 1988, INT C PATT REC
[8]  
ATAER E, 2006, INT WORKSH MULT INF
[9]   An omnifont open-vocabulary OCR system for English and Arabic [J].
Bazzi, I ;
Schwartz, R ;
Makhoul, J .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1999, 21 (06) :495-504
[10]  
BROOK S, 2008, WSEAS 7 INT C ART IN