Glyph Miner: A System for Efficiently Extracting Glyphs from Early Prints in the Context of OCR

被引:3
作者
Budig, Benedikt [1 ]
van Dijk, Thomas C. [1 ]
Kirchner, Felix [2 ]
机构
[1] Univ Wurzburg, Chair Comp Sci 1, Wurzburg, Germany
[2] Univ Lib Wurzburg, Digitizat Ctr, Wurzburg, Germany
来源
2016 IEEE/ACM JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL) | 2016年
关键词
Early Prints; Document Recognition; OCR; Glyph Extraction; Efficient User Interaction;
D O I
10.1145/2910896.2910915
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
While off-the-shelf OCR systems work well on many modern documents, the heterogeneity of early prints provides a significant challenge. To achieve good recognition quality, existing software must be "trained" specifically to each particular corpus. This is a tedious process that involves significant user effort. In this paper we demonstrate a system that generically replaces a common part of the training pipeline with a more efficient workflow: Given a set of scanned pages of a historical document, our system uses an efficient user interaction to semi-automatically extract large numbers of occurrences of glyphs indicated by the user. In a preliminary case study, we evaluate the effectiveness of our approach by embedding our system into the workflow at the University Library Wurzburg.
引用
收藏
页码:31 / 34
页数:4
相关论文
共 14 条
[1]   E-library of medieval chant manuscript transcriptions [J].
Barton, LWG ;
Caldwell, JA ;
Jeavons, PG .
PROCEEDINGS OF THE 5TH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, PROCEEDINGS, 2005, :320-329
[2]  
Behr M., 2014, BUCHDRUCK SPRACHWAND
[3]   Active Learning for Classifying Template Matches in Historical Maps [J].
Budig, Benedikt ;
van Dijk, Thomas C. .
DISCOVERY SCIENCE, DS 2015, 2015, 9356 :33-47
[4]  
Caluori U, 2013, ARCHIVING 2013: FINAL PROGRAM AND PROCEEDINGS, P143
[5]   Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments [J].
Clausner, C. ;
Pletschacher, S. ;
Antonacopoulos, A. .
11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, :48-52
[6]  
Clausner C., 2014, 11 INT ASS PATTERN R, P19
[7]  
Dalitz C., 2009, DOC IMAGE ANAL GAMER, P53
[8]   Character string recognition on maps, a rotation-invariant recognition method [J].
Deseilligny, MP ;
LeMen, H ;
Stamon, G .
PATTERN RECOGNITION LETTERS, 1995, 16 (12) :1297-1310
[9]  
Droettboom M., 2002, JCDL 2002. Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Libraries, P11, DOI 10.1145/544220.544223
[10]  
Helinski M., 2012, IMPROVING ACCESS TEX