Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text

被引:49
作者
Carrell, David [1 ]
Malin, Bradley [2 ,3 ]
Aberdeen, John [4 ]
Bayer, Samuel [4 ]
Clark, Cheryl [4 ]
Wellner, Ben [4 ]
Hirschman, Lynette [4 ]
机构
[1] Grp Hlth Res Inst, Seattle, WA 98101 USA
[2] Vanderbilt Univ, Dept Biomed Informat, Nashville, TN USA
[3] Vanderbilt Univ, Dept Elect Engn & Comp Sci, Nashville, TN USA
[4] Mitre Corp, Bedford, MA 01730 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
AUTOMATIC DE-IDENTIFICATION; OF-THE-ART; MEDICAL-RECORDS; DOCUMENTS;
D O I
10.1136/amiajnl-2012-001034
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective Secondary use of clinical text is impeded by a lack of highly effective, low-cost de-identification methods. Both, manual and automated methods for removing protected health information, are known to leave behind residual identifiers. The authors propose a novel approach for addressing the residual identifier problem based on the theory of Hiding In Plain Sight (HIPS). Materials and Methods HIPS relies on obfuscation to conceal residual identifiers. According to this theory, replacing the detected identifiers with realistic but synthetic surrogates should collectively render the few 'leaked' identifiers difficult to distinguish from the synthetic surrogates. The authors conducted a pilot study to test this theory on clinical narrative, de-identified by an automated system. Test corpora included 31 oncology and 50 family practice progress notes read by two trained chart abstractors and an informaticist. Results Experimental results suggest approximately 90% of residual identifiers can be effectively concealed by the HIPS approach in text containing average and high densities of personal identifying information. Discussion This pilot test suggests HIPS is feasible, but requires further evaluation. The results need to be replicated on larger corpora of diverse origin under a range of detection scenarios. Error analyses also suggest areas where surrogate generation techniques can be refined to improve efficacy. Conclusions If these results generalize to existing high-performing de-identification systems with recall rates of 94e98%, HIPS could increase the effective de-identification rates of these systems to levels above 99% without further advancements in system recall. Additional and more rigorous assessment of the HIPS approach is warranted.
引用
收藏
页码:342 / 348
页数:7
相关论文
共 28 条
[1]   The MITRE Identification Scrubber Toolkit: Design, training, and assessment [J].
Aberdeen, John ;
Bayer, Samuel ;
Yeniterzi, Reyyan ;
Wellner, Ben ;
Clark, Cheryl ;
Hanauer, David ;
Malin, Bradley ;
Hirschman, Lynette .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2010, 79 (12) :849-859
[2]   Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions [J].
Chapman, Wendy W. ;
Nadkarni, Prakash M. ;
Hirschman, Lynette ;
D'Avolio, Leonard W. ;
Savova, Guergana K. ;
Uzuner, Ozlem .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2011, 18 (05) :540-543
[3]   Current issues in biomedical text mining and natural language processing [J].
Chapman, Wendy W. ;
Cohen, K. Bretonnel .
JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (05) :757-759
[4]  
Darr DA, 2006, METHOD INFORM MED, V45, P246
[5]  
DE-ID Data Corp, 2012, DE ID HLTH DAT SAF S
[6]   What can natural language processing do for clinical decision support? [J].
Demner-Fushman, Dina ;
Chapman, Wendy W. ;
McDonald, Clement J. .
JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (05) :760-772
[7]   Comparison of Natural Language Processing Biosurveillance Methods for Identifying Influenza From Encounter Notes [J].
Elkin, Peter L. ;
Froehling, David A. ;
Wahner-Roedler, Dietlind L. ;
Brown, Steven H. ;
Bailey, Kent R. .
ANNALS OF INTERNAL MEDICINE, 2012, 156 (01) :11-U57
[8]   A software tool for removing patient identifying information from clinical documents [J].
Friedlin, F. Jeff ;
McDonald, Clement J. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2008, 15 (05) :601-610
[9]  
Gardner J, 2009, P 12 INT C EXT DAT T
[10]   An integrated framework for de-identifying unstructured medical data [J].
Gardner, James ;
Xiong, Li .
DATA & KNOWLEDGE ENGINEERING, 2009, 68 (12) :1441-1451