Word-based correction tor retrieval of arabic OCR degraded documents

被引:0
作者
Magdy, Walid [1 ]
Darwish, Kareem [1 ]
机构
[1] IBM Corp, Technol Dev Ctr, Giza, Egypt
来源
STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS | 2006年 / 4209卷
关键词
OCR; retrieval; and error correction;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR'ed to ease their retrieval. This paper explores the effect of word-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. The OCR correction uses an improved character segment based noisy channel model and is tested on real and synthetic OCR degradation. Results show that the effect of OCR correction depends on the length of the index term used and that indexing using short n-grams is perhaps superior to word-based error correction. The results are potentially applicable to other languages.
引用
收藏
页码:205 / 216
页数:12
相关论文
共 30 条
[1]  
Abu-Salem H, 1999, J AM SOC INFORM SCI, V50, P524, DOI 10.1002/(SICI)1097-4571(1999)50:6<524::AID-ASI7>3.0.CO
[2]  
2-M
[3]  
AGIRRE EK, 1998, COLING ACL 98
[4]  
Ahmed Mohamed Attia, 2000, THESIS CAIRO U CAIRO
[5]  
ALJLAYL, 2001, TREC 2001 GAITH MD
[6]  
ALKHARASHI IA, 1994, J AM SOC INFORM SCI, V45, P548, DOI 10.1002/(SICI)1097-4571(199409)45:8<548::AID-ASI3>3.0.CO
[7]  
2-X
[8]  
BAEZAYATES R, 1996, SPRINGER VERLAG LNCS
[9]  
DARWISH K, 2002, TREC 200I GAITH
[10]  
DARWISH K, 2003, SIGIR 2003