Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique

被引:24
作者
Akram, Qurat ul Ain [1 ]
Hussain, Sarmad [1 ]
Niazi, Aneeta [1 ]
Anjum, Umair [1 ]
Irfan, Faheem [1 ]
机构
[1] Univ Engn & Technol, Al Khawarizmi Inst Comp Sci, Ctr Language Engn, Lahore, Pakistan
来源
2014 11TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS 2014) | 2014年
关键词
Tesseract; Urdu; Nastalique; OCR;
D O I
10.1109/DAS.2014.45
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu language which is a very complex and cursive writing style of Arabic script. Original Tesseract system has 65.59% and 65.84% accuracies for 14 and 16 font sizes respectively, whereas the modified system, with reduced search space, gives 97.87% and 97.71% accuracies respectively. The efficiency is also improved from an average of 170 milliseconds (ms) to an average of 84 ms for the recognition of Nastalique document images.
引用
收藏
页码:191 / 195
页数:5
相关论文
共 21 条
[1]  
Akram M., 2010, 8 WORKSH ALR COLING
[2]  
[Anonymous], THESIS
[3]  
[Anonymous], ICDAR
[4]  
Chaulagain B., 2009, FINAL REPORT N UNPUB
[5]  
CLE, CLE URD HFL 14 POINT
[6]  
CLE, CLE URD HFL 16 POINT
[7]  
Davis M., 2013, UNICODE TEXT SEGMENT
[8]  
El-Korashy, 2013, ICDAR
[9]  
Hasnat M. A., 2009, ICDAR
[10]  
Hussain S., 2003, 12 ANN C E WORLDS AM