Survey of Post-OCR Processing Approaches

被引:60
|
作者
Thi Tuyet Hai Nguyen [1 ]
Jatowt, Adam [2 ]
Coustaty, Mickael [1 ]
Doucet, Antoine [1 ]
机构
[1] Univ La Rochelle, L3i, La Rochelle, France
[2] Univ Innsbruck, Innsbruck, Austria
基金
欧盟地平线“2020”;
关键词
Post-OCR processing; OCR merging; error model; language model; machine learning; statistical and neural machine translation; ERROR-CORRECTION; ALIGNMENT; RECOGNITION;
D O I
10.1145/3453476
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the postOCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.
引用
收藏
页数:37
相关论文
共 50 条
  • [21] A SURVEY AND ANALYSIS OF CURRENT CAPTCHA APPROACHES
    Roshanbin, Narges
    Miller, James
    JOURNAL OF WEB ENGINEERING, 2013, 12 (1-2): : 1 - 40
  • [22] Approaches to Automated Detection of Cyberbullying: A Survey
    Salawu, Semiu
    He, Yulan
    Lumsden, Joanna
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2020, 11 (01) : 3 - 24
  • [23] Survey on crop pest detection using deep learning and machine learning approaches
    Chithambarathanu, M.
    Jeyakumar, M. K.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (27) : 42277 - 42310
  • [24] Intelligent Radio Signal Processing: A Survey
    Quoc-Viet Pham
    Nhan Thanh Nguyen
    Thien Huynh-The
    Le, Long Bao
    Lee, Kyungchun
    Hwang, Won-Joo
    IEEE ACCESS, 2021, 9 : 83818 - 83850
  • [25] A Survey of Digital Map Processing Techniques
    Chiang, Yao-Yi
    Leyk, Stefan
    Knoblock, Craig A.
    ACM COMPUTING SURVEYS, 2014, 47 (01)
  • [26] Computational Approaches for Gene Prediction: A Comparative Survey
    Al-Turaiki, Israa M.
    Mathkour, Hassan
    Touir, Ameur
    Hammami, Saleh
    INFORMATICS ENGINEERING AND INFORMATION SCIENCE, PT II, 2011, 252 : 14 - 25
  • [27] Sign Language Translation: A Survey of Approaches and Techniques
    Liang, Zeyu
    Li, Huailing
    Chai, Jianping
    ELECTRONICS, 2023, 12 (12)
  • [28] Survey of Approaches to Parameter Tuning for Database Systems
    Cao R.
    Bao L.
    Cui J.
    Li H.
    Zhou H.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (03): : 635 - 653
  • [29] A survey of machine learning approaches in animal behaviour
    Kleanthous, Natasa
    Hussain, Abir Jaafar
    Khan, Wasiq
    Sneddon, Jennifer
    Al-Shamma'a, Ahmed
    Liatsis, Panos
    NEUROCOMPUTING, 2022, 491 : 442 - 463
  • [30] A survey of approaches for implementing optical neural networks
    Xu, Runqin
    Lv, Pin
    Xu, Fanjiang
    Shi, Yishi
    OPTICS AND LASER TECHNOLOGY, 2021, 136