De-identification of clinical free text using natural language processing: A systematic review of current approaches

被引:8
作者
Kovacevic, Aleksandar [1 ]
Basaragin, Bojana [2 ]
Milosevic, Nikola [2 ,3 ]
Nenadic, Goran [4 ]
机构
[1] Univ Novi Sad, Fac Tech Sci, Trg Dositeja Obradovica 6, Novi Sad 21002, Serbia
[2] Inst Artificial Intelligence Res & Dev Serbia, Fruskogorska 1, Novi Sad 21000, Serbia
[3] Bayer AG, Res & Dev, Mullerstr 173, D-13342 Berlin, Germany
[4] Univ Manchester, Dept Comp Sci, Manchester, England
基金
英国工程与自然科学研究理事会;
关键词
de; -identification; natural language processing; English clinical free text; PROTECTED HEALTH INFORMATION; RECORDS; NARRATIVES;
D O I
10.1016/j.artmed.2024.102845
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Background: Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process. Objectives: Our study aims to provide systematic evidence on how the de-identification of clinical free text written in English has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems for the English language. In addition, we aim to identify challenges and potential research opportunities in this field. Methods: A systematic search in PubMed, Web of Science, and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance. Results: A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. The majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora. Conclusion: Earlier de-identification approaches aimed at English were mainly rule and machine learning hybrids with extensive feature engineering and post-processing, while more recent performance improvements are due to feature-inferring recurrent neural networks. Current leading performance is achieved using attention-based neural models. Recent studies report state-of-the-art F1-scores (over 98 %) when evaluated in the manner usually adopted by the clinical natural language processing community. However, their performance needs to be more thoroughly assessed with different measures to judge their reliability to safely de-identify data in a realworld setting. Without additional manually labeled training data, state-of-the-art systems fail to generalise well across a wide range of clinical sub-domains.
引用
收藏
页数:31
相关论文
共 128 条
  • [1] Abadeer M., 2020, P 3 CLIN NAT LANG PR, P158, DOI DOI 10.18653/V1/2020.CLINICALNLP-1.18
  • [2] Using word embeddings to improve the privacy of clinical notes
    Abdalla, Mohamed
    Abdalla, Moustafa
    Rudzicz, Frank
    Hirst, Graeme
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2020, 27 (06) : 901 - 907
  • [3] The MITRE Identification Scrubber Toolkit: Design, training, and assessment
    Aberdeen, John
    Bayer, Samuel
    Yeniterzi, Reyyan
    Wellner, Ben
    Clark, Cheryl
    Hanauer, David
    Malin, Bradley
    Hirschman, Lynette
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2010, 79 (12) : 849 - 859
  • [4] Abu-El-Rub Noor, 2022, AMIA Jt Summits Transl Sci Proc, V2022, P92
  • [5] Ahmed Abdullah, 2021, AMIA Jt Summits Transl Sci Proc, V2021, P102
  • [6] De-identification of electronic health record using neural network
    Ahmed, Tanbir
    Al Aziz, Md Momin
    Mohammed, Noman
    [J]. SCIENTIFIC REPORTS, 2020, 10 (01)
  • [7] Akbik A, 2019, NAACL HLT 2019: THE 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE DEMONSTRATIONS SESSION, P54
  • [8] Alsentzer E., 2019, P 2 CLIN NAT LANG PR, P72, DOI [DOI 10.18653/V1/W19-1909, 10.18653/v1/W19-1909]
  • [9] De-identification of Unstructured Clinical Texts from Sequence to Sequence Perspective
    Anjum, Md Monowar
    Mohammed, Noman
    Jiang, Xiaoqian
    [J]. CCS '21: PROCEEDINGS OF THE 2021 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2021, : 2438 - 2440
  • [10] [Anonymous], 2021, Med Phys Mar., V48, P1341, DOI DOI 10.1002/MP.14664