OMNIDOCUMENT TECHNOLOGIES

被引:67
作者
BOKSER, M
机构
[1] Calera Recognition Systems, Inc., Sunnyvale, CA
关键词
TEXT RECOGNITION; OCR; OMNIFONT; MULTIFONT; POLYFONT; FEATURE EXTRACTION; CLASSIFICATION;
D O I
10.1109/5.156470
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
With recent technical advances, OCR is now a viable technology for a wide range of applications. Calera's OCR engine is omnifont and reasonably robust on individual degraded characters. The weakest link is its handling of characters which are difficult to segment, such as characters which are joined to adjacent characters. The engine is divided into four phases: segmentation, image recognition, ambiguity resolution, and document analysis. The features are zonal and reduce the image to a blurred, gray-level representation. The classifier is data-driven, trained off-line, and model-free. We found that handcrafted features and decision trees tend to be brittle in the presence of noise To satisfy the needs of full-text applications, the system captures the structure of the document so that, when viewed in a word processor or spreadsheet program, the formatting of the OCR'd document reflects the formatting of the original document. To satisfy the needs of the forms market, a proofing and correction tool displays "pop-up" images of uncertain characters.
引用
收藏
页码:1066 / 1078
页数:13
相关论文
共 43 条