Are Searches in OCR-generated Archives Trustworthy? An Analysis of Digital Newspaper Archives

被引:2
作者
Burchardt, Jorgen [1 ]
机构
[1] Nyborgvej 13, DK-5750 Ringe, Denmark
来源
JAHRBUCH FUR WIRTSCHAFTSGESCHICHTE | 2023年 / 64卷 / 01期
关键词
optical character recognition; historical archive; source criticism; research methodology; Historische Archive; Quellenkritik; Forschungsmethodik; OCR;
D O I
10.1515/jbwg-2023-0003
中图分类号
F [经济];
学科分类号
02 ;
摘要
Digitised archives are revolutionary tools for research that, in a few seconds, generate results that earlier often took years to obtain. But do they provide all results for the terms searched for? The accuracy of searches was tested by performing sample searches of leading newspaper databases. The test revealed several weaknesses in the search process, including an average 18 percent error rate for single words in body text, and a far higher error rates for advertisements. Such high error rates encourage a critical look at the 20-year-old sector. Although these errors can be reduced by a re-digitation and with new improved OCR engines and new search algorithms, searches will nevertheless return manipulated results. In response, and to identify infringed bias and skewed representation, database owners need to provide thorough metadata to ensure source criticism.
引用
收藏
页码:31 / 54
页数:24
相关论文
共 54 条
[1]  
Adesam Y, 2019, Digital Humanities in the Nordic and Baltic Countries Publications, V2, P9, DOI 10.5617/dhnbpub.11018
[2]  
[Anonymous], ?About us"
[3]  
[Anonymous], 2022, INT C THEOR PRACT DI, P252, DOI [10.1007/978-3-319-24592-8_19, DOI 10.1007/978-3-319-24592-8_19]
[4]  
[Anonymous], 2022, EUROPEANA NEWSPAPERS
[5]  
Ayres M. -L., 2022, IFLA WLIC 2013
[6]  
Beals Melodee H., 2020, The Atlas of Digitised Newspapers and Metadata: Reports from Oceanic Exchanges, DOI ./10.6084
[8]  
Bos P., 2022, P 3 HISTOINFORMATICS, P57
[9]  
Carrasco R.C., 2014, P 1 INT C DIG ACC TE, P179, DOI [10.1145/2595188.2595221, DOI 10.1145/2595188.2595221]
[10]   Large scale analysis of violent death count in daily newspapers to quantify bias and censorship [J].
Casolino, Marco .
JOURNAL OF BIG DATA, 2020, 7 (01)