Large-scale genealogical information extraction from handwritten Quebec parish records

被引:7
|
作者
Tarride, Solene [1 ]
Maarand, Martin [1 ]
Boillet, Melodie [1 ,2 ]
McGrath, James [3 ]
Capel, Eugenie [3 ]
Vezina, Helene [3 ]
Kermorvant, Christopher [1 ,2 ]
机构
[1] TEKLIA, Paris, France
[2] Normandie Univ, LITIS, Rouen, France
[3] Univ Quebec Chicoutimi, BALSAC Project, Saguenay, PQ, Canada
关键词
Information extraction; Document layout analysis; Handwritten text recognition; Historical documents; Quebec parish records;
D O I
10.1007/s10032-023-00427-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a complete workflow designed for extracting information from Quebec handwritten parish registers. The acts in these documents contain individual and family information highly valuable for genetic, demographic and social studies of the Quebec population. From an image of parish records, our workflow is able to identify the acts and extract personal information. The workflow is divided into successive steps: page classification, text line detection, handwritten text recognition, named entity recognition and act detection and classification. For all these steps, different machine learning models are compared. Once the information is extracted, validation rules designed by experts are then applied to standardize the extracted information and ensure its consistency with the type of act (birth, marriage and death). This validation step is able to reject records that are considered invalid or merged. The full workflow has been used to process over two million pages of Quebec parish registers from the 19-20th centuries. On a sample comprising 65% of registers, 3.2 million acts were recognized. Verification of the birth and death acts from this sample shows that 74% of them are considered complete and valid. These records will be integrated into the BALSAC database and linked together to recreate family and genealogical relations at large scale.
引用
收藏
页码:255 / 272
页数:18
相关论文
共 42 条
  • [31] Information extraction for prognostic stage prediction from breast cancer medical records using NLP and ML
    Pratiksha R. Deshmukh
    Rashmi Phalnikar
    Medical & Biological Engineering & Computing, 2021, 59 : 1751 - 1772
  • [32] Information extraction for prognostic stage prediction from breast cancer medical records using NLP and ML
    Deshmukh, Pratiksha R.
    Phalnikar, Rashmi
    MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING, 2021, 59 (09) : 1751 - 1772
  • [33] A Refinement System for Medical Information Extraction from Text-based Bilingual Electronic Medical Records
    Bae, Inho
    Kim, Jin-Sang
    HEALTHCARE INFORMATICS RESEARCH, 2008, 14 (03) : 267 - 274
  • [34] Comparative Analysis of Large Language Models in Structured Information Extraction from Job Postings
    Sioziou, Kyriaki
    Zervas, Panagiotis
    Giotopoulos, Kostas
    Tzimas, Giannis
    ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EANN 2024, 2024, 2141 : 82 - 92
  • [35] A Comprehensive Evaluation of a Novel Approach to Probabilistic Information Extraction from Large Unstructured Datasets
    Trovati, Marcello
    2015 International Conference on Intelligent Networking and Collaborative Systems IEEE INCoS 2015, 2015, : 459 - 462
  • [36] Information Extraction from Electronic Medical Records Using Multitask Recurrent Neural Network with Contextual Word Embedding
    Yang, Jianliang
    Liu, Yuenan
    Qian, Minghui
    Guan, Chenghua
    Yuan, Xiangfei
    APPLIED SCIENCES-BASEL, 2019, 9 (18):
  • [37] #nowplaying Madonna: a large-scale evaluation on estimating similarities between music artists and between movies from microblogs
    Schedl, Markus
    INFORMATION RETRIEVAL, 2012, 15 (3-4): : 183 - 217
  • [38] #nowplaying Madonna: a large-scale evaluation on estimating similarities between music artists and between movies from microblogs
    Markus Schedl
    Information Retrieval, 2012, 15 : 183 - 217
  • [39] Unlocking colonial records with Artificial Intelligence. Achieving the automated transcription of large-scale 16th and 17th-century Latin American historical collections
    Murrieta-Flores, Patricia
    Vega-Sanchez, Rodrigo
    Sanchez-Diaz, Alexander
    Cruz-Rios, Hector Francisco
    SCIENCE AND TECHNOLOGY OF ARCHAEOLOGICAL RESEARCH, 2025, 11 (01):
  • [40] A Methodology for Open Information Extraction and Representation from Large Scientific Corpora: The CORD-19 Data Exploration Use Case
    Papadopoulos, Dimitris
    Papadakis, Nikolaos
    Litke, Antonis
    APPLIED SCIENCES-BASEL, 2020, 10 (16):