A novel Arabic OCR post-processing using rule-based and word context techniques

被引:0
作者
Iyad Abu Doush
Faisal Alkhateeb
Anwaar Hamdi Gharaibeh
机构
[1] American University of Kuwait,Computer Science and Information Systems Department
[2] Yarmouk University,Computer Sciences Department
来源
International Journal on Document Analysis and Recognition (IJDAR) | 2018年 / 21卷
关键词
Automatic post-processing; Arabic OCR post-processing; Language model; Alignment technique; Error model;
D O I
暂无
中图分类号
学科分类号
摘要
Optical character recognition (OCR) is the process of recognizing characters automatically from scanned documents for editing, indexing, searching, and reducing the storage space. The resulted text from the OCR usually does not match the text in the original document. In order to minimize the number of incorrect words in the obtained text, OCR post-processing approaches can be used. Correcting OCR errors is more complicated when we are dealing with the Arabic language because of its complexity such as connected letters, different letters may have the same shape, and the same letter may have different forms. This paper provides a statistical Arabic language model and post-processing techniques based on hybridizing the error model approach with the context approach. The proposed model is language independent and non-constrained with the string length. To the best of our knowledge, this is the first end-to-end OCR post-processing model that is applied to the Arabic language. In order to train the proposed model, we build Arabic OCR context database which contains 9000 images of Arabic text. Also, the evaluation of the OCR post-processing system results is automated using our novel alignment technique which is called fast automatic hashing text alignment. Our experimental results show that the rule-based system improves the word error rate from 24.02% to become 20.26% by using a training data set of 1000 images. On the other hand, after this training, we apply the rule-based system on 500 images as a testing dataset and the word error rate is improved from 14.95% to become 14.53%. The proposed hybrid OCR post-processing system improves the results based on using 1000 training images from a word error rate of 24.02% to become 18.96%. After training the hybrid system, we used 500 images for testing and the results show that the word error rate enhanced from 14.95 to become 14.42. The obtained results show that the proposed hybrid system outperforms the rule-based system.
引用
收藏
页码:77 / 89
页数:12
相关论文
共 34 条
[1]  
Abdelraouf A(2010)Building a multi-modal Arabic corpus (MMAC) Int. J. Doc. Anal. Recognit. (IJDAR) 13 285-302
[2]  
Higgins CA(2015)Improving post-processing optical character recognition (OCR) documents with Arabic language using spelling error detection and correction Int. J. Reason.-Based Intell. Syst. 8 91-103
[3]  
Pridmore T(2016)What we have and what is needed, how to evaluate Arabic Speech Synthesizer? Int. J. Speech Technol. 19 415-432
[4]  
Khalil M(2013)A survey on Arabic character segmentation Int. J. Doc. Anal. Recognit. (IJDAR) 16 105-126
[5]  
Abu Doush I(2017)Arabic optical character recognition software: a review Pattern Recognit. Image Anal. 27 763-776
[6]  
Al-Trad A(2014)A survey of digital image processing techniques in character recognition Int. J. Comput. Sci. Netw. Secur. (IJCSNS) 14 65-893
[7]  
Abu Doush I(2007)A weighted finite-state framework for correcting errors in natural scene OCR Ninth Int. Conf. Doc. Anal. Recognit. 2 889-402
[8]  
Alkhatib F(2014)Two bigrams based language model for auto correction of Arabic OCR errors Int. J. Digit. Content Technol. Appl. 8 72-45
[9]  
Bsoul AAR(1980)Approximate string matching ACM Comput. Surv. (CSUR) 12 381-439
[10]  
Alginahi YM(2002)Off-line Arabic character recognition-a review Pattern Anal. Appl. 5 31-710