Evaluating the state-of-the-art in automatic de-identification

被引:275
作者
Uzuner, Oezlem
Luo, Yuan
Szolovits, Peter
机构
[1] SUNY Albany, Albany, NY 12222 USA
[2] MIT, CSAIL PS, Cambridge, MA USA
关键词
D O I
10.1197/jamia.M2444
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To facilitate and survey studies in automatic de-identification, as a part of the i2b2 (Informatics for Integrating Biology to the Bedside) project, authors organized a Natural Language Processing (NLP) challenge on automatically removing private health information (PHI) from medical discharge records. This manuscript provides an overview of this de-identification challenge, describes the data and the annotation process, explains the evaluation metrics, discusses the nature of the systems that addressed the challenge, analyzes the results of received system runs, and identifies directions for future research. The de-indentification challenge data consisted of discharge summaries drawn from the Partners Healthcare system. Authors prepared this data for the challenge by replacing authentic PHI with synthesized surrogates. To focus the challenge on non-dictionary-based de-identification methods, the data was enriched with out-of-vocabulary PHI surrogates, i.e., made up names. The data also included some PHI surrogates that were ambiguous with medical non-PHI terms. A total of seven teams participated in the challenge. Each team submitted up to three system runs, for a total of sixteen submissions. The authors used precision, recall, and F-measure to evaluate the submitted system runs based on their token-level and instance-level performance on the ground truth. The systems with the best performance scored above 98% in F-measure for all categories of PHI. Most out-of-vocabulary PHI could be identified accurately. However, identifying ambiguous PHI proved challenging. The performance of systems on the test data set is encouraging. Future evaluations of these systems will involve larger data sets from more heterogeneous sources.
引用
收藏
页码:550 / 563
页数:14
相关论文
共 46 条
  • [1] *AL I CORP, LINGPIPE
  • [2] [Anonymous], 2001, P 18 INT C MACH LEAR
  • [3] Aramaki E, 2006, I2B2 WORKSH CHALL NA
  • [4] BANKO M, 2001, 1 INT C HUM LANG TEC, P1
  • [5] Development and evaluation of an open source software tool for deidentification of pathology reports
    Beckwith B.A.
    Mahaadevan R.
    Balis U.J.
    Kuo F.
    [J]. BMC Medical Informatics and Decision Making, 6 (1)
  • [6] Will the wave finally break? A brief view of the adoption of electronic medical records in the United States
    Berner, ES
    Detmer, DE
    Simborg, D
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2005, 12 (01) : 3 - 7
  • [7] Assessing explicit error reporting in the narrative electronic medical record using keyword searching
    Cao, H
    Stetson, P
    Hripcsak, G
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2003, 36 (1-2) : 99 - 105
  • [8] A simple algorithm for identifying negated findings and diseases in discharge summaries
    Chapman, WW
    Bridewell, W
    Hanbury, P
    Cooper, GF
    Buchanan, BG
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2001, 34 (05) : 301 - 310
  • [9] Classifying free-text triage chief complaints into syndromic categories with natural language processing
    Chapman, WW
    Christensen, LM
    Wagner, MM
    Haug, PJ
    Ivanov, O
    Dowling, JN
    Olszewski, RT
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2005, 33 (01) : 31 - 40
  • [10] CHINCHOR N, 1992, FOURTH MESSAGE UNDERSTANDING CONFERENCE (MUC-4), P30