Automatic De-Identification of French Clinical Records: Comparison of Rule-Based and Machine-Learning Approaches

被引:14
|
作者
Grouin, Cyril [1 ]
Zweigenbaum, Pierre [1 ]
机构
[1] LIMSI CNRS, F-91400 Orsay, France
来源
MEDINFO 2013: PROCEEDINGS OF THE 14TH WORLD CONGRESS ON MEDICAL AND HEALTH INFORMATICS, PTS 1 AND 2 | 2013年 / 192卷
关键词
Information Protection; Natural Language Processing; Medical Records; AGREEMENT; DOCUMENTS;
D O I
10.3233/978-1-61499-289-9-476
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.
引用
收藏
页码:476 / 480
页数:5
相关论文
共 5 条
  • [1] A comparison of rule-based and machine learning approaches for classifying patient portal messages
    Cronin, Robert M.
    Fabbri, Daniel
    Denny, Joshua C.
    Rosenbloom, S. Trent
    Jackson, Gretchen Purcell
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2017, 105 : 110 - 120
  • [2] Mining fall-related information in clinical notes: Comparison of rule-based and novel word embedding-based machine learning approaches
    Topaz, Maxim
    Murga, Ludmila
    Gaddis, Katherine M.
    McDonald, Margaret V.
    Bar-Bachar, Ofrit
    Goldberg, Yoav
    Bowles, Kathryn H.
    JOURNAL OF BIOMEDICAL INFORMATICS, 2019, 90
  • [3] Negation detection in Dutch clinical texts: an evaluation of rule-based and machine learning methods
    Bram van Es
    Leon C. Reteig
    Sander C. Tan
    Marijn Schraagen
    Myrthe M. Hemker
    Sebastiaan R. S. Arends
    Miguel A. R. Rios
    Saskia Haitjema
    BMC Bioinformatics, 24
  • [4] Negation detection in Dutch clinical texts: an evaluation of rule-based and machine learning methods
    van Es, Bram
    Reteig, Leon C.
    Tan, Sander C.
    Schraagen, Marijn
    Hemker, Myrthe M.
    Arends, Sebastiaan R. S.
    Rios, Miguel A. R.
    Haitjema, Saskia
    BMC BIOINFORMATICS, 2023, 24 (01)
  • [5] Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5, Llama-3-8B, and spaCy's rule-based and machine learning-based methods
    Bhattarai, Kriti
    Oh, Inez Y.
    Sierra, Jonathan Moran
    Tang, Jonathan
    Payne, Philip R. O.
    Abrams, Zach
    Lai, Albert M.
    JAMIA OPEN, 2024, 7 (03)