Survey of Post-OCR Processing Approaches

被引:60
|
作者
Thi Tuyet Hai Nguyen [1 ]
Jatowt, Adam [2 ]
Coustaty, Mickael [1 ]
Doucet, Antoine [1 ]
机构
[1] Univ La Rochelle, L3i, La Rochelle, France
[2] Univ Innsbruck, Innsbruck, Austria
基金
欧盟地平线“2020”;
关键词
Post-OCR processing; OCR merging; error model; language model; machine learning; statistical and neural machine translation; ERROR-CORRECTION; ALIGNMENT; RECOGNITION;
D O I
10.1145/3453476
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the postOCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.
引用
收藏
页数:37
相关论文
共 50 条
  • [31] Data-driven approaches in FinTech: a survey
    Tian, Xin
    He, Jing Selena
    Han, Meng
    INFORMATION DISCOVERY AND DELIVERY, 2021, 49 (02) : 123 - 135
  • [32] Unsupervised Approaches for Textual Semantic Annotation, A Survey
    Liao, Xiaofeng
    Zhao, Zhiming
    ACM COMPUTING SURVEYS, 2019, 52 (04)
  • [33] A Survey of Extractive Arabic Text Summarization Approaches
    Lagrini, Samira
    Redjimi, Mohammed
    Aziz, Nabiha
    ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, 2018, 782 : 159 - 171
  • [34] Biomedical Question Answering: A Survey of Approaches and Challenges
    Jin, Qiao
    Yuan, Zheng
    Xiong, Guangzhi
    Yu, Qianlan
    Ying, Huaiyuan
    Tan, Chuanqi
    Chen, Mosha
    Huang, Songfang
    Liu, Xiaozhong
    Yu, Sheng
    ACM COMPUTING SURVEYS, 2023, 55 (02)
  • [35] A comprehensive survey on human pose estimation approaches
    Dubey, Shradha
    Dixit, Manish
    MULTIMEDIA SYSTEMS, 2023, 29 (01) : 167 - 195
  • [36] A Survey on Evolutionary Computation Approaches to Feature Selection
    Xue, Bing
    Zhang, Mengjie
    Browne, Will N.
    Yao, Xin
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2016, 20 (04) : 606 - 626
  • [37] A Post Processing Method to Speech Recognition
    王轩
    王晓龙
    Journal of Harbin Institute of Technology, 1997, (01) : 105 - 109
  • [38] Artificial Intelligence, Speech, and Language Processing Approaches to Monitoring Alzheimer's Disease: A Systematic Review
    Garcia, Sofia de la Fuente
    Ritchie, Craig W.
    Luz, Saturnino
    JOURNAL OF ALZHEIMERS DISEASE, 2020, 78 (04) : 1547 - 1574
  • [39] A systematic survey on deep learning and machine learning approaches of fake news detection in the pre- and post-COVID-19 pandemic
    Varma, Rajshree
    Verma, Yugandhara
    Vijayvargiya, Priya
    Churi, Prathamesh P.
    INTERNATIONAL JOURNAL OF INTELLIGENT COMPUTING AND CYBERNETICS, 2021, 14 (04) : 617 - 646
  • [40] Enhancing Parkinson's Disease Detection and Diagnosis: A Survey of Integrative Approaches Across Diverse Modalities
    Dhivyaa, C. R.
    Nithya, K.
    Anbukkarasi, S.
    IEEE ACCESS, 2024, 12 : 158999 - 159024