The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge

被引:10
作者
Bui, Duy Duc An [1 ]
Wyatt, Mathew [1 ]
Cimino, James J. [1 ]
机构
[1] Univ Alabama Birmingham, Informat Inst, Birmingham, AL 35294 USA
关键词
Automatic de-identification; Clinical natural language processing; Shared task; Machine learning;
D O I
10.1016/j.jbi.2017.05.001
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Clinical narratives (the text notes found in patients' medical records) are important information sources for secondary use in research. However, in order to protect patient privacy, they must be de-identified prior to use. Manual de-identification is considered to be the gold standard approach but is tedious, expensive, slow, and impractical for use with large-scale clinical data. Automated or semi-automated de-identification using computer algorithms is a potentially promising alternative. The Informatics Institute of the University of Alabama at Birmingham is applying de-identification to clinical data drawn from the UAB hospital's electronic medical records system before releasing them for research. We participated in a shared task challenge by the Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDoC Individualized Domains (N-GRID) at the de-identification regular track to gain experience developing our own automatic de-identification tool. We focused on the popular and successful methods from previous challenges: rule-based, dictionary-matching, and machine-learning approaches. We also explored new techniques such as disambiguation rules, term ambiguity measurement, and used multi-pass sieve framework at a micro level. For the challenge's primary measure (strict entity), our submissions achieved competitive results (f-measures: 87.3%, 87.1%, and 86.7%). For our preferred measure (binary token HIPAA), our submissions achieved superior results (f-measures: 93.7%, 93.6%, and 93%). With those encouraging results, we gain the confidence to improve and use the tool for the real de-identification task at the UAB Informatics Institute. (C) 2017 Elsevier Inc. All rights reserved.
引用
收藏
页码:S54 / S61
页数:8
相关论文
共 30 条
  • [1] EFFICIENT STRING MATCHING - AID TO BIBLIOGRAPHIC SEARCH
    AHO, AV
    CORASICK, MJ
    [J]. COMMUNICATIONS OF THE ACM, 1975, 18 (06) : 333 - 340
  • [2] [Anonymous], 2014, STANFORD CORENLP NAT
  • [3] [Anonymous], 2010, P EMNLP 2010
  • [4] [Anonymous], 2005, P 43 ANN M ASS COMPU
  • [5] [Anonymous], 2014, Trans. Assoc. Comput. Linguist., DOI DOI 10.1162/tacl_a_00182
  • [6] Carus A.B., 1999, GOOGLE PATENTS
  • [7] Combining knowledge- and data-driven methods for de-identification of clinical narratives
    Dehghan, Azad
    Kovacevic, Aleksandar
    Karystianis, George
    Keane, John A.
    Nenadic, Goran
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 58 : S53 - S59
  • [8] Dernoncourt F., 2016, J AM MED INFORM ASSN
  • [9] Douglass M., 2004, COMPUTERS CARDIOLOGY
  • [10] PDF text classification to leverage information extraction from publication reports
    Duy Duc An Bui
    Del Fiol, Guilherme
    Jonnalagadda, Siddhartha
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2016, 61 : 141 - 148