Multilingual Character Segmentation and Recognition Schemes for Indian Document Images

被引:45
作者
Sahare, Parul [1 ]
Dhok, Sanjay B. [1 ]
机构
[1] Visvesvaraya Natl Inst Technol, Dept Elect & Commun Engn, Ctr VLSI & Nanotechnol, Nagpur 440010, Maharashtra, India
关键词
Character recognition; character segmentation; document analysis; graph theory; multilingual Indian optical character recognition; OCR SYSTEM; SCRIPTS; TRANSFORM; ENGLISH;
D O I
10.1109/ACCESS.2018.2795104
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, robust algorithms for character segmentation and recognition are presented for multilingual Indian document images of Latin and Devanagari scripts. These documents generally suffer from their layout organizations, local skews, and low print quality and contain intermixed texts (machine-printed and handwritten). In the proposed character segmentation algorithm, primary segmentation paths are obtained using structural property of characters, whereas overlapped and joined characters are separated using graph distance theory. Finally, segmentation results are validated using highly accurate support vector machine classifier. For the proposed character recognition algorithm, three new geometrical shape-based features are computed. First and second features are formed with respect to the center pixel of character, whereas neighborhood information of text pixels is used for the calculation of third feature. For recognizing the input character, k-Nearest Neighbor classifier is used, as it has intrinsically zero training time. Comprehensive experiments are carried out on different databases containing printed as well as handwritten texts. Benchmarking results illustrate that proposed algorithms have better performances compared to other contemporary approaches, where highest segmentation and recognition rates of 98.86% and 99.84%, respectively, are obtained.
引用
收藏
页码:10603 / 10617
页数:15
相关论文
共 67 条
[1]  
Agam G., 2006, COMPLEX DOCUMENT IMA
[2]  
[Anonymous], 2009, P 4 INT C COMP VIS T
[3]  
[Anonymous], LEG TOB DOC LIB LTDL
[4]   Multilingual OCR system for South Indian scripts and English documents: An approach based on Fourier transform and principal component analysis [J].
Aradhya, V. N. Manjunath ;
Kumar, G. Hemantha ;
Noushath, S. .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2008, 21 (04) :658-668
[5]   Optical character recognition for cursive handwriting [J].
Arica, N ;
Yarman-Vural, FT .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (06) :801-813
[6]   Recognition of Bangla compound characters using structural decomposition [J].
Bag, Soumen ;
Harit, Gaurav ;
Bhowmick, Partha .
PATTERN RECOGNITION, 2014, 47 (03) :1187-1201
[7]   Segmentation of touching and fused Devanagari characters [J].
Bansal, V ;
Sinha, RMK .
PATTERN RECOGNITION, 2002, 35 (04) :875-893
[8]   An omnifont open-vocabulary OCR system for English and Arabic [J].
Bazzi, I ;
Schwartz, R ;
Makhoul, J .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1999, 21 (06) :495-504
[9]   Handwritten Numeral Databases of Indian Scripts and Multistage Recognition of Mixed Numerals [J].
Bhattacharya, Ujjwal ;
Chaudhuri, B. B. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2009, 31 (03) :444-457
[10]  
Boser B. E., 1992, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, P144, DOI 10.1145/130385.130401