The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge

被引：10

作者：

Bui, Duy Duc An ^{[1
]}

Wyatt, Mathew ^{[1
]}

Cimino, James J. ^{[1
]}

机构：

[1] Univ Alabama Birmingham, Informat Inst, Birmingham, AL 35294 USA

来源：

JOURNAL OF BIOMEDICAL INFORMATICS | 2017年 / 75卷

关键词：

Automatic de-identification; Clinical natural language processing; Shared task; Machine learning;

D O I：

10.1016/j.jbi.2017.05.001

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Clinical narratives (the text notes found in patients' medical records) are important information sources for secondary use in research. However, in order to protect patient privacy, they must be de-identified prior to use. Manual de-identification is considered to be the gold standard approach but is tedious, expensive, slow, and impractical for use with large-scale clinical data. Automated or semi-automated de-identification using computer algorithms is a potentially promising alternative. The Informatics Institute of the University of Alabama at Birmingham is applying de-identification to clinical data drawn from the UAB hospital's electronic medical records system before releasing them for research. We participated in a shared task challenge by the Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDoC Individualized Domains (N-GRID) at the de-identification regular track to gain experience developing our own automatic de-identification tool. We focused on the popular and successful methods from previous challenges: rule-based, dictionary-matching, and machine-learning approaches. We also explored new techniques such as disambiguation rules, term ambiguity measurement, and used multi-pass sieve framework at a micro level. For the challenge's primary measure (strict entity), our submissions achieved competitive results (f-measures: 87.3%, 87.1%, and 86.7%). For our preferred measure (binary token HIPAA), our submissions achieved superior results (f-measures: 93.7%, 93.6%, and 93%). With those encouraging results, we gain the confidence to improve and use the tool for the real de-identification task at the UAB Informatics Institute. (C) 2017 Elsevier Inc. All rights reserved.

引用

页码：S54 / S61

页数：8

共 30 条

[1] EFFICIENT STRING MATCHING - AID TO BIBLIOGRAPHIC SEARCH
AHO, AV
CORASICK, MJ
[J]. COMMUNICATIONS OF THE ACM, 1975, 18 (06) : 333 - 340
[2] [Anonymous], 2014, STANFORD CORENLP NAT
[3] [Anonymous], 2010, P EMNLP 2010
[4] [Anonymous], 2005, P 43 ANN M ASS COMPU
[5] [Anonymous], 2014, Trans. Assoc. Comput. Linguist., DOI DOI 10.1162/tacl_a_00182
[6] Carus A.B., 1999, GOOGLE PATENTS
[7] Combining knowledge- and data-driven methods for de-identification of clinical narratives
Dehghan, Azad
Kovacevic, Aleksandar
Karystianis, George
Keane, John A.
Nenadic, Goran
[J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 58 : S53 - S59
[8] Dernoncourt F., 2016, J AM MED INFORM ASSN
[9] Douglass M., 2004, COMPUTERS CARDIOLOGY
[10] PDF text classification to leverage information extraction from publication reports
Duy Duc An Bui
Del Fiol, Guilherme
Jonnalagadda, Siddhartha
[J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2016, 61 : 141 - 148

← 1 2 3 →