Survey of Post-OCR Processing Approaches

被引:60
|
作者
Thi Tuyet Hai Nguyen [1 ]
Jatowt, Adam [2 ]
Coustaty, Mickael [1 ]
Doucet, Antoine [1 ]
机构
[1] Univ La Rochelle, L3i, La Rochelle, France
[2] Univ Innsbruck, Innsbruck, Austria
基金
欧盟地平线“2020”;
关键词
Post-OCR processing; OCR merging; error model; language model; machine learning; statistical and neural machine translation; ERROR-CORRECTION; ALIGNMENT; RECOGNITION;
D O I
10.1145/3453476
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the postOCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.
引用
收藏
页数:37
相关论文
共 50 条
  • [41] Geometric Algebra in Signal and Image Processing: A Survey
    Wang, Rui
    Wang, Kaili
    Cao, Wenming
    Wang, Xiangyang
    IEEE ACCESS, 2019, 7 : 156315 - 156325
  • [42] APSD: A Framework for Automated Processing of Survey Documents
    Yasmin, Farzana
    Hossain, Syed Mohammod Minhaz
    Arefin, Mohammad Shamsul
    2017 INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION ENGINEERING (ECCE), 2017, : 411 - 416
  • [43] A survey of machine learning for big data processing
    Qiu, Junfei
    Wu, Qihui
    Ding, Guoru
    Xu, Yuhua
    Feng, Shuo
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2016,
  • [44] Computational intelligence in processing of speech acoustics: a survey
    Amitoj Singh
    Navkiran Kaur
    Vinay Kukreja
    Virender Kadyan
    Munish Kumar
    Complex & Intelligent Systems, 2022, 8 : 2623 - 2661
  • [45] A survey of machine learning for big data processing
    Junfei Qiu
    Qihui Wu
    Guoru Ding
    Yuhua Xu
    Shuo Feng
    EURASIP Journal on Advances in Signal Processing, 2016
  • [46] Source code authorship approaches natural language processing
    Petrik, Juraj
    Chuda, Daniela
    COMPUTER SYSTEMS AND TECHNOLOGIES (COMPSYSTECH'18), 2018, 1641 : 58 - 61
  • [47] Technical Approaches to Chinese Sign Language Processing: A Review
    Kamal, Suhail Muhammad
    Chen, Yidong
    Li, Shaozi
    Shi, Xiaodong
    Zheng, Jiangbin
    IEEE ACCESS, 2019, 7 : 96926 - 96935
  • [48] Computational intelligence in processing of speech acoustics: a survey
    Singh, Amitoj
    Kaur, Navkiran
    Kukreja, Vinay
    Kadyan, Virender
    Kumar, Munish
    COMPLEX & INTELLIGENT SYSTEMS, 2022, 8 (03) : 2623 - 2661
  • [49] Computer Vision and Image Processing Approaches for Corrosion Detection
    Ali, Ahmad Ali Imran Mohd
    Jamaludin, Shahrizan
    Imran, Md Mahadi Hasan
    Ayob, Ahmad Faisal Mohamad
    Ahmad, Sayyid Zainal Abidin Syed
    Akhbar, Mohd Faizal Ali
    Suhrab, Mohammed Ismail Russtam
    Ramli, Mohamad Riduan
    JOURNAL OF MARINE SCIENCE AND ENGINEERING, 2023, 11 (10)
  • [50] AI Approaches in Processing and Using Data in Personalized Medicine
    Ivanovic, Mirjana
    Autexier, Serge
    Kokkonidis, Miltiadis
    ADVANCES IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2022, 2022, 13389 : 11 - 24