Automatic De-Identification of French Clinical Records: Comparison of Rule-Based and Machine-Learning Approaches

被引：14

作者：

Grouin, Cyril ^{[1
]}

Zweigenbaum, Pierre ^{[1
]}

机构：

[1] LIMSI CNRS, F-91400 Orsay, France

来源：

MEDINFO 2013: PROCEEDINGS OF THE 14TH WORLD CONGRESS ON MEDICAL AND HEALTH INFORMATICS, PTS 1 AND 2 | 2013年 / 192卷

关键词：

Information Protection; Natural Language Processing; Medical Records; AGREEMENT; DOCUMENTS;

D O I：

10.3233/978-1-61499-289-9-476

中图分类号：

R19 [保健组织与事业（卫生事业管理）];

学科分类号：

摘要：

In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.

引用

页码：476 / 480

页数：5

共 5 条

[1] A comparison of rule-based and machine learning approaches for classifying patient portal messages
Cronin, Robert M.
Fabbri, Daniel
Denny, Joshua C.
Rosenbloom, S. Trent
Jackson, Gretchen Purcell
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2017, 105 : 110 - 120
[2] Mining fall-related information in clinical notes: Comparison of rule-based and novel word embedding-based machine learning approaches
Topaz, Maxim
Murga, Ludmila
Gaddis, Katherine M.
McDonald, Margaret V.
Bar-Bachar, Ofrit
Goldberg, Yoav
Bowles, Kathryn H.
JOURNAL OF BIOMEDICAL INFORMATICS, 2019, 90
[3] Negation detection in Dutch clinical texts: an evaluation of rule-based and machine learning methods
Bram van Es
Leon C. Reteig
Sander C. Tan
Marijn Schraagen
Myrthe M. Hemker
Sebastiaan R. S. Arends
Miguel A. R. Rios
Saskia Haitjema
BMC Bioinformatics, 24
[4] Negation detection in Dutch clinical texts: an evaluation of rule-based and machine learning methods
van Es, Bram
Reteig, Leon C.
Tan, Sander C.
Schraagen, Marijn
Hemker, Myrthe M.
Arends, Sebastiaan R. S.
Rios, Miguel A. R.
Haitjema, Saskia
BMC BIOINFORMATICS, 2023, 24 (01)
[5] Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5, Llama-3-8B, and spaCy's rule-based and machine learning-based methods
Bhattarai, Kriti
Oh, Inez Y.
Sierra, Jonathan Moran
Tang, Jonathan
Payne, Philip R. O.
Abrams, Zach
Lai, Albert M.
JAMIA OPEN, 2024, 7 (03)

← 1 →