Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing

被引:30
作者
Thi-Tuyet-Hai Nguyen [1 ]
Jatowt, Adam [2 ]
Coustaty, Mickael [1 ]
Nhu-Van Nguyen [1 ]
Doucet, Antoine [1 ]
机构
[1] Univ La Rochelle, L3i, La Rochelle, France
[2] Kyoto Univ, Grad Sch Informat, Kyoto, Japan
来源
2019 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2019) | 2019年
关键词
OCR errors; OCR post-processing; post-OCR text correction;
D O I
10.1109/JCDL.2019.00015
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Post-OCR is an important processing step that follows optical character recognition (OCR) and is meant to improve the quality of OCR documents by detecting and correcting residual errors. This paper describes the results of a statistical analysis of OCR errors on four document collections. Five aspects related to general OCR errors are studied and compared with human-generated misspellings, including edit operations, length effects, erroneous character positions, real-word vs. non-word errors, and word boundaries. Based on the observations from the analysis we give several suggestions related to the design and implementation of effective OCR post-processing approaches.
引用
收藏
页码:29 / 38
页数:10
相关论文
共 28 条
  • [1] Afli H, 2016, Int J Comput Ling Appl, V7, P175
  • [2] Bassil Youssef, 2012, ARXIV12040191
  • [3] Chiron G, 2017, ACM-IEEE J CONF DIG, P249
  • [4] ICDAR2017 Competition on Post-OCR Text Correction
    Chiron, Guillaume
    Doucet, Antoine
    Coustaty, Mickael
    Moreux, Jean-Philippe
    [J]. 2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 1423 - 1428
  • [5] A TECHNIQUE FOR COMPUTER DETECTION AND CORRECTION OF SPELLING ERRORS
    DAMERAU, FJ
    [J]. COMMUNICATIONS OF THE ACM, 1964, 7 (03) : 171 - 176
  • [6] Evershed J., 2014, P 1 INT C DIG ACC TE, P45, DOI DOI 10.1145/2595188.2595200
  • [7] Hagon Paul, 2013, AUSTR LIB INF ASS IN
  • [8] Islam A., 2009, P 2009 C EMPIRICAL M, V3, P1241
  • [9] Jones MarkA., 1991, Proceedings of the International Conference on Document Analysis and Recognition, P925
  • [10] OCR Error Correction Using Character Correction and Feature-Based Word Classification
    Kissos, Ido
    Dershowitz, Nachum
    [J]. PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016), 2016, : 198 - 203