Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique

被引：24

作者：

Akram, Qurat ul Ain ^{[1
]}

Hussain, Sarmad ^{[1
]}

Niazi, Aneeta ^{[1
]}

Anjum, Umair ^{[1
]}

Irfan, Faheem ^{[1
]}

机构：

[1] Univ Engn & Technol, Al Khawarizmi Inst Comp Sci, Ctr Language Engn, Lahore, Pakistan

来源：

2014 11TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS 2014) | 2014年

关键词：

Tesseract; Urdu; Nastalique; OCR;

D O I：

10.1109/DAS.2014.45

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Tesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu language which is a very complex and cursive writing style of Arabic script. Original Tesseract system has 65.59% and 65.84% accuracies for 14 and 16 font sizes respectively, whereas the modified system, with reduced search space, gives 97.87% and 97.71% accuracies respectively. The efficiency is also improved from an average of 170 milliseconds (ms) to an average of 84 ms for the recognition of Nastalique document images.

引用

页码：191 / 195

页数：5