Combining knowledge- and data-driven methods for de-identification of clinical narratives

被引:38
作者
Dehghan, Azad [1 ,2 ]
Kovacevic, Aleksandar [3 ]
Karystianis, George [1 ,2 ]
Keane, John A. [1 ,4 ]
Nenadic, Goran [1 ,4 ,5 ]
机构
[1] Univ Manchester, Sch Comp Sci, Manchester, Lancs, England
[2] Christie NHS Fdn Trust, Manchester, Lancs, England
[3] Univ Novi Sad, Fac Tech Sci, Novi Sad, Serbia
[4] Univ Manchester, Manchester Inst Biotechnol, Manchester, Lancs, England
[5] Farr Inst Hlth Informat Res, Hlth eRes Ctr, London, England
关键词
De-identification; Named entity recognition; Information extraction; Clinical text mining; Electronic health record;
D O I
10.1016/j.jbi.2015.06.029
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
A recent promise to access unstructured clinical data from electronic health records on large-scale has revitalized the interest in automated de-identification of clinical notes, which includes the identification of mentions of Protected Health Information (PHI). We describe the methods developed and evaluated as part of the i2b2/UTHealth 2014 challenge to identify PHI defined by 25 entity types in longitudinal clinical narratives. Our approach combines knowledge-driven (dictionaries and rules) and data-driven (machine learning) methods with a large range of features to address de-identification of specific named entities. In addition, we have devised a two-pass recognition approach that creates a patient-specific run-time dictionary from the PHI entities identified in the first step with high confidence, which is then used in the second pass to identify mentions that lack specific clues. The proposed method achieved the overall micro F-1-measures of 91% on strict and 95% on token-level evaluation on the test dataset (514 narratives). Whilst most PHI entities can be reliably identified, particularly challenging were mentions of Organizations and Professions. Still, the overall results suggest that automated text mining methods can be used to reliably process clinical notes to identify personal information and thus providing a crucial step in large-scale de-identification of unstructured data for further clinical and epidemiological studies. (C) 2015 Elsevier Inc. All rights reserved.
引用
收藏
页码:S53 / S59
页数:7
相关论文
共 24 条
[1]   The MITRE Identification Scrubber Toolkit: Design, training, and assessment [J].
Aberdeen, John ;
Bayer, Samuel ;
Yeniterzi, Reyyan ;
Wellner, Ben ;
Clark, Cheryl ;
Hanauer, David ;
Malin, Bradley ;
Hirschman, Lynette .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2010, 79 (12) :849-859
[2]  
[Anonymous], 2005, ACL, DOI 10.3115/1219840.1219885
[3]  
[Anonymous], 2001, P 18 INT C MACHINE L
[4]  
Aramaki E, 2006, I2B2 WORKSH CHALL NA
[5]  
Cunningham H., 2002, ACL, P507
[6]  
Dehghan A., 2013, BOUNDARY IDENTIFICAT
[7]   Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records [J].
Fernandes, Andrea C. ;
Cloete, Danielle ;
Broadbent, Matthew T. M. ;
Hayes, Richard D. ;
Chang, Chin-Kuo ;
Jackson, Richard G. ;
Roberts, Angus ;
Tsang, Jason ;
Soncul, Murat ;
Liebscher, Jennifer ;
Stewart, Robert ;
Callard, Felicity .
BMC MEDICAL INFORMATICS AND DECISION MAKING, 2013, 13
[8]   BoB, a best-of-breed automated text de-identification system for VHA clinical documents [J].
Ferrandez, Oscar ;
South, Brett R. ;
Shen, Shuying ;
Friedlin, F. Jeffrey ;
Samore, Matthew H. ;
Meystre, Stephane M. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (01) :77-83
[9]   HIDE: An integrated system for health information DE-identification [J].
Gardner, James ;
Xiong, Li .
PROCEEDINGS OF THE 21ST IEEE INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS, 2008, :254-259
[10]   PhysioBank, PhysioToolkit, and PhysioNet - Components of a new research resource for complex physiologic signals [J].
Goldberger, AL ;
Amaral, LAN ;
Glass, L ;
Hausdorff, JM ;
Ivanov, PC ;
Mark, RG ;
Mietus, JE ;
Moody, GB ;
Peng, CK ;
Stanley, HE .
CIRCULATION, 2000, 101 (23) :E215-E220