ICDAR2017 Competition on Post-OCR Text Correction

被引:26
|
作者
Chiron, Guillaume [1 ]
Doucet, Antoine [2 ]
Coustaty, Mickael [2 ]
Moreux, Jean-Philippe [1 ]
机构
[1] Natl Lib France, F-75706 Paris, France
[2] Univ La Rochelle, Lab L3i, Av Michel Crepeau, F-17000 La Rochelle, France
来源
2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1 | 2017年
关键词
D O I
10.1109/ICDAR.2017.232
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes the ICDAR2017 competition on post-OCR text correction and presents the different methods submitted by the participants. OCR has been an active research field for over the past 30 years but results are still imperfect, especially for historical documents. The purpose of this competition is to compare and evaluate automatic approaches for correcting (denoising) OCR-ed texts. The challenge consists of two independent tasks: 1) error detection and 2) error correction. An original dataset of 12M OCR-ed symbols along with an aligned ground truth was provided to the participants with 80% of the dataset dedicated to the training and 20% to the evaluation. Different sources were aggregated and namely contain newspapers and monographs covering 2 languages (English and French). 11 teams submitted results, while the difficulty of the task was underlined by the fact that only half of the submitted methods were able to denoise the evaluation dataset on average. In any case, this competition, which counted 35 registrations, illustrates the strong interest of the community in this essential problem, which is key to any digitization process involving textual data.
引用
收藏
页码:1423 / 1428
页数:6
相关论文
共 50 条
  • [21] Survey of Post-OCR Processing Approaches
    Thi Tuyet Hai Nguyen
    Jatowt, Adam
    Coustaty, Mickael
    Doucet, Antoine
    ACM COMPUTING SURVEYS, 2021, 54 (06)
  • [22] ICDAR2017 Robust Reading Challenge on Text Extraction from Biomedical Literature Figures (DeTEXT)
    Yang, Chun
    Yin, Xu-Cheng
    Yu, Hong
    Karatzas, Dimosthenis
    Cao, Yu
    2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 1444 - 1447
  • [23] ICDAR2017 Robust Reading Challenge on Omnidirectional Video
    Iwamura, Masakazu
    Morimoto, Naoyuki
    Tainaka, Keishi
    Bazazian, Dena
    Gomez, Lluis
    Karatzas, Dimosthenis
    2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 1448 - 1453
  • [24] Lexicographical-based Order for Post-OCR Correction of Named Entities
    Jean-Caurant, Axel
    Tamani, Nouredine
    Courboulay, Vincent
    Burie, Jean-Christophe
    2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 1192 - 1197
  • [25] Unsupervised Multi-View Post-OCR Error Correction With Language Models
    Gupta, Harsh
    Del Corro, Luciano
    Broscheit, Samuel
    Hoffart, Johannes
    Brenner, Eliot
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 8647 - 8652
  • [26] Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing
    Thi-Tuyet-Hai Nguyen
    Jatowt, Adam
    Coustaty, Mickael
    Nhu-Van Nguyen
    Doucet, Antoine
    2019 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2019), 2019, : 29 - 38
  • [27] Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models
    Ramirez-Orta, Juan
    Xamena, Eduardo
    Maguitman, Ana
    Milios, Evangelos
    Soto, Axel J.
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11192 - 11199
  • [28] Post-OCR Correction with OpenAI's GPT Models on Challenging English Prosody Texts
    Zhang, James
    Haverals, Wouter
    Naydan, Mary
    Kernighan, Brian W.
    PROCEEDINGS OF THE 2024 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, DOCENG 2024, 2024,
  • [29] ICDAR 2024 Competition on Multi Font Group Recognition and OCR
    van der Loop, Janne
    Kordon, Florian
    Mayr, Martin
    Christlein, Vincent
    Wu, Fei
    Rodriguez-Salas, Dalia
    Weichselbaumer, Nikolaus
    Seuret, Mathias
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 381 - 396
  • [30] Post-OCR Paragraph Recognition by Graph Convolutional Networks
    Wang, Renshen
    Fujii, Yasuhisa
    Popat, Ashok C.
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2533 - 2542