Line and Ligature Segmentation of Urdu Nastaleeq Text

被引:18
作者
Ahmad, Ibrar [1 ]
Wang, Xiaojie [1 ]
Li, Ruifan [1 ]
Ahmed, Manzoor [2 ]
Ullah, Rahat [3 ]
机构
[1] Beijing Univ Posts & Telecommun, Ctr Intelligence Sci & Technol, Beijing 100876, Peoples R China
[2] Tsinghua Univ, Dept Elect Engn, FIB Lab, Beijing 100084, Peoples R China
[3] Nanjing Univ Informat Sci & Technol, Nanjing 210044, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Urdu text segmentation; Nastaleeq script segmentation; line and ligature segmentation; preprocessing; OPTICAL CHARACTER-RECOGNITION;
D O I
10.1109/ACCESS.2017.2703155
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The recognition accuracy of ligature-based Urdu language optical character recognition (OCR) systems highly depends on the accuracy of segmentation that converts Urdu text into lines and ligatures. In general, lines and ligatures-based Urdu language OCRs are more successful as compared to characters-based. This paper presents the techniques for segmenting Urdu Nastaleeq text images into lines and subsequently to ligatures. Classical horizontal projection-based segmentation method is augmented with a curved-line-split algorithm for successfully overcoming the problems, such as text line split position, overlapping, merged ligatures, and ligatures crossing line split positions. Ligature segmentation algorithm extracts connected components from text lines, categorizes them into primary and secondary classes, and allocates secondary components to the primary class by examining width, height, coordinates, overlapping, centroids, and baseline information. The proposed line segmentation algorithm is tested on 47 pages with 99.17% accuracy. The proposed ligature segmentation algorithm is mainly tested on a large Urdu-printed text images data set. The proposed algorithm segmented Urdu-printed text images data set to 189 000 ligatures from 10 063 text lines having 332 000 connected components. A total of about 142 000 secondary components have been successfully allocated to more than 189 000 primary ligatures with accuracy rate of 99.80%. Thus, both of the proposed segmentation algorithms outperform the existing algorithms employed for Urdu Nastaleeq text segmentation. Moreover, the proposed line segmentation algorithm is also tested on Arabic, for which it also extracted lines correctly.
引用
收藏
页码:10924 / 10940
页数:17
相关论文
共 35 条
[1]  
Ahmad I, 2017, CHINA COMMUN, V14, P146, DOI 10.1109/CC.2017.7839765
[2]  
Ahmad Z, 2007, PROC WRLD ACAD SCI E, V26, P249
[3]  
Akram Q. U. A., 2010, P GRAD C COMP SCI GC, V1
[4]  
[Anonymous], 2012, P C LANG TECHN
[5]  
[Anonymous], 2002, MULT TOP C 2002 INMI, DOI DOI 10.1109/INMIC.2002.1310191
[6]  
[Anonymous], 2010, JWAoS Engineering
[7]  
[Anonymous], 2009, PROC IEEE 13 INT MUL, DOI DOI 10.1109/INMIC.2009.5383111
[8]  
[Anonymous], 2010, P 2 INT C COMP ENG T
[9]  
[Anonymous], P 12 AMIC ANN C E WO
[10]  
[Anonymous], 1992, Structured Document Image Analysis