Survey of Post-OCR Processing Approaches

被引:60
|
作者
Thi Tuyet Hai Nguyen [1 ]
Jatowt, Adam [2 ]
Coustaty, Mickael [1 ]
Doucet, Antoine [1 ]
机构
[1] Univ La Rochelle, L3i, La Rochelle, France
[2] Univ Innsbruck, Innsbruck, Austria
基金
欧盟地平线“2020”;
关键词
Post-OCR processing; OCR merging; error model; language model; machine learning; statistical and neural machine translation; ERROR-CORRECTION; ALIGNMENT; RECOGNITION;
D O I
10.1145/3453476
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the postOCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.
引用
收藏
页数:37
相关论文
共 50 条
  • [1] Neural Machine Translation Approaches for Post-OCR Text Processing
    Topcu, Ayse Irem
    Toreyin, Behcet Ugur
    2022 30TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2022,
  • [2] Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction
    Nguyen, Thi-Tuyet-Hai
    Coustaty, Mickael
    Doucet, Antoine
    Jatowt, Adam
    Nguyen, Nhu-Van
    MATURITY AND INNOVATION IN DIGITAL LIBRARIES, ICADL 2018, 2018, 11279 : 278 - 289
  • [3] Post-OCR Correction with OpenAI's GPT Models on Challenging English Prosody Texts
    Zhang, James
    Haverals, Wouter
    Naydan, Mary
    Kernighan, Brian W.
    PROCEEDINGS OF THE 2024 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, DOCENG 2024, 2024,
  • [4] A novel Arabic OCR post-processing using rule-based and word context techniques
    Abu Doush, Iyad
    Alkhateeb, Faisal
    Gharaibeh, Anwaar Hamdi
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2018, 21 (1-2) : 77 - 89
  • [5] A novel Arabic OCR post-processing using rule-based and word context techniques
    Iyad Abu Doush
    Faisal Alkhateeb
    Anwaar Hamdi Gharaibeh
    International Journal on Document Analysis and Recognition (IJDAR), 2018, 21 : 77 - 89
  • [6] A rule-based post-processing approach to improve Persian OCR performance
    Khosrobeigi, Z.
    Veisi, H.
    Ahmadi, H. R.
    Shabanian, H.
    SCIENTIA IRANICA, 2020, 27 (06) : 3019 - 3033
  • [7] Data augmentation approaches in natural language processing: A survey
    Li, Bohan
    Hou, Yutai
    Che, Wanxiang
    AI OPEN, 2022, 3 : 71 - 90
  • [8] Processing Handwritten Words by Intelligent Use of OCR Results
    Mund, Benjamin
    Steinke, Karl-Heinz
    ADVANCES IN DATA MINING: APPLICATIONS AND THEORETICAL ASPECTS, 2010, 6171 : 174 - 185
  • [9] Synergizing machine learning & symbolic methods: A survey on hybrid approaches to natural language processing
    Panchendrarajan, Rrubaa
    Zubiaga, Arkaitz
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 251
  • [10] A Survey on Textual Entailment: Benchmarks, Approaches and Applications
    Alharahseheh, Yara
    Obeidat, Rasha
    Al-Ayoub, Mahmoud
    Gharaibeh, Maram
    2022 13TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), 2022, : 328 - 336