Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique

被引:24
作者
Akram, Qurat ul Ain [1 ]
Hussain, Sarmad [1 ]
Niazi, Aneeta [1 ]
Anjum, Umair [1 ]
Irfan, Faheem [1 ]
机构
[1] Univ Engn & Technol, Al Khawarizmi Inst Comp Sci, Ctr Language Engn, Lahore, Pakistan
来源
2014 11TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS 2014) | 2014年
关键词
Tesseract; Urdu; Nastalique; OCR;
D O I
10.1109/DAS.2014.45
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu language which is a very complex and cursive writing style of Arabic script. Original Tesseract system has 65.59% and 65.84% accuracies for 14 and 16 font sizes respectively, whereas the modified system, with reduced search space, gives 97.87% and 97.71% accuracies respectively. The efficiency is also improved from an average of 170 milliseconds (ms) to an average of 84 ms for the recognition of Nastalique document images.
引用
收藏
页码:191 / 195
页数:5
相关论文
共 21 条
[11]  
Ijaz M., 2007, C LANG TECHN U PESHW
[12]  
Javed S.T., 2010, SEGMENTATION FREE NA
[13]  
Krayem A., 2013, ICDAR
[14]  
Lehal G.S., 2013, 4 INT WORKSH MULT OC
[15]  
Rakhshit S., 2009, INT C ADV COMP VIS I
[16]  
Rakshit S., 2010, INT J COMPUTER APPL, V6
[17]  
Sabbour N., 2013, SPIE, V8658
[18]  
Shah Z., 2002, INMIC KAR PAK
[19]  
Smith R., 2009, INT WORKSH MULT OCR
[20]  
Tariq S., 2013, CIARP HAV CUB