Ocropodium: open source OCR for small-scale historical archives

被引:11
作者
Blanke, Tobias [1 ]
Bryant, Michael [1 ]
Hedges, Mark [1 ]
机构
[1] Kings Coll London, Ctr E Res, London WC2B 5RL, England
关键词
historical archives; open source; optical character recognition; workflow;
D O I
10.1177/0165551511429418
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale digitization projects dealing with text-based historical material face challenges that are not well catered for by commercial software. This article discusses the results of a project to build a scalable OCR workflow for historical collections based on open source tools that is particularly tailored towards use in small-scale historical archives. It argues that open source tools allow for better customization to match these requirements, particularly with regard to character model training and per-project language modelling. We offer insights into our accuracy evaluation results of various open source OCR tools, as well as a case study about the challenges and opportunities of open source OCR in historical archives.
引用
收藏
页码:76 / 86
页数:11
相关论文
共 13 条
  • [1] [Anonymous], P 3 WORKSH AN NOIS U
  • [2] Breuel T., 2009, P INT WORKSHOP MULTI, P1
  • [3] Breuel T. M., 1992, Proceedings. 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.92CH3168-2), P445, DOI 10.1109/CVPR.1992.223152
  • [4] Bryant M, 2010, LECT NOTES COMPUT SC, V6273, P522, DOI 10.1007/978-3-642-15464-5_72
  • [5] Duguid Paul., 2007, First Monday, V12
  • [6] Dunning A, 2007, STORMONT PAPERS PART
  • [7] Holley R., 2009, D LIB MAGAZINE MAGAZ, V15
  • [8] HISTORICAL REVIEW OF OCR RESEARCH-AND-DEVELOPMENT
    MORI, S
    SUEN, CY
    YAMAMOTO, K
    [J]. PROCEEDINGS OF THE IEEE, 1992, 80 (07) : 1029 - 1058
  • [9] Neudecker Clemens, 2010, LIBER Quarterly, V20, P119
  • [10] Ploeger L, 2009, D LIB MAGAZINE, V15