Post-ocr text correction for Bulgarian historical documents

被引：0

作者：

Beshirov, Angel ^{[1
]}

Dobreva, Milena ^{[1
]}

Dimitrov, Dimitar ^{[1
]}

Hardalov, Momchil ^{[2
]}

Koychev, Ivan ^{[1
]}

Nakov, Preslav ^{[3
]}

机构：

[1] Sofia Univ St Kliment Ohridski, FMI, Sofia, Bulgaria

[2] AWS AI Labs, Barcelona, Spain

[3] Mohamed Bin Zayed Univ Artificial Intelligence, Abu Dhabi, U Arab Emirates

来源：

INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES | 2025年 / 26卷 / 01期

关键词：

Post-OCR text correction; Synthetic data; Orthographic variety; LLMs; Character-level sequence-to-sequence model;

D O I：

10.1007/s00799-025-00415-x

中图分类号：

G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];

学科分类号：

1205 ; 120501 ;

摘要：

The digitization of historical documents is crucial for preserving the cultural heritage of the society. An essential step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a challenging problem as standard OCR tools are not tailored to deal with historical orthography or challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during the recognition. It improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at https://github.com/angelbeshirov/post-ocr-text-correction.

引用

页数：11

共 34 条

[1] Beshirov Angel., 2024, Dimitar Dimitrov
[2] Chantal Amrhein, 2018, Language Technology and Computational Linguistic
[3] Charalozova Katya, 2022, Codification of the Norms of the Bulgarian Standard Language from the End of the 19th and the Beginning of the 20th Century (1879-1921
[4] ICDAR2017 Competition on Post-OCR Text Correction
Chiron, Guillaume
Doucet, Antoine
Coustaty, Mickael
Moreux, Jean-Philippe
[J]. 2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 1423 - 1428
[5] Christophe Rigaud, 2019, Dataset of ICDAR 2019 competition on post-OCR text correction
[6] Conneau Alexis, 2020, Unsupervised cross-lingual representation learning at scale, P8440, DOI DOI 10.18653/V1/2020.ACL-MAIN.747
[7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8] Dong R, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P2363
[9] Estrella Paula, 2014, P 1 INT C DIGITAL A, P119
[10] Eva Dhondt., 2017, P 8 INT JOINT C NAT, V1, P1006

← 1 2 3 4 →