Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes

被引:56
作者
Norgeot, Beau [1 ]
Muenzen, Kathleen [1 ]
Peterson, Thomas A. [1 ]
Fan, Xuancheng [1 ]
Glicksberg, Benjamin S. [1 ]
Schenk, Gundolf [1 ]
Rutenberg, Eugenia [1 ]
Oskotsky, Boris [1 ]
Sirota, Marina [1 ]
Yazdany, Jinoos [2 ]
Schmajuk, Gabriela [2 ,3 ]
Ludwig, Dana [1 ]
Goldstein, Theodore [1 ]
Butte, Atul J. [1 ,4 ]
机构
[1] Univ Calif San Francisco, Bakar Computat Hlth Sci Inst, San Francisco, CA 94143 USA
[2] Univ Calif San Francisco, Dept Med, Div Rheumatol, San Francisco, CA 94143 USA
[3] San Francisco VA Med Ctr, San Francisco, CA USA
[4] Univ Calif Hlth, Ctr Data Driven Insights & Innovat, Oakland, CA 94607 USA
基金
美国国家卫生研究院;
关键词
IDENTIFICATION;
D O I
10.1038/s41746-020-0258-y
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
There is a great and growing need to ascertain what exactly is the state of a patient, in terms of disease progression, actual care practices, pathology, adverse events, and much more, beyond the paucity of data available in structured medical record data. Ascertaining these harder-to-reach data elements is now critical for the accurate phenotyping of complex traits, detection of adverse outcomes, efficacy of off-label drug use, and longitudinal patient surveillance. Clinical notes often contain the most detailed and relevant digital information about individual patients, the nuances of their diseases, the treatment strategies selected by physicians, and the resulting outcomes. However, notes remain largely unused for research because they contain Protected Health Information (PHI), which is synonymous with individually identifying data. Previous clinical note de-identification approaches have been rigid and still too inaccurate to see any substantial real-world use, primarily because they have been trained with too small medical text corpora. To build a new de-identification tool, we created the largest manually annotated clinical note corpus for PHI and develop a customizable open-source de-identification software called Philter ("Protected Health Information filter"). Here we describe the design and evaluation of Philter, and show how it offers substantial real-world improvements over prior methods.
引用
收藏
页数:8
相关论文
共 25 条
[1]   The MITRE Identification Scrubber Toolkit: Design, training, and assessment [J].
Aberdeen, John ;
Bayer, Samuel ;
Yeniterzi, Reyyan ;
Wellner, Ben ;
Clark, Cheryl ;
Hanauer, David ;
Malin, Bradley ;
Hirschman, Lynette .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2010, 79 (12) :849-859
[2]  
Afzal Naveed, 2017, AMIA Jt Summits Transl Sci Proc, V2017, P28
[3]   Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research [J].
Deleger, Louise ;
Lingren, Todd ;
Ni, Yizhao ;
Kaiser, Megan ;
Stoutenborough, Laura ;
Marsolo, Keith ;
Kouril, Michal ;
Molnar, Katalin ;
Solti, Imre .
JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 50 :173-183
[4]   Large-scale evaluation of automated clinical note de-identification and its impact on information extraction [J].
Deleger, Louise ;
Molnar, Katalin ;
Savova, Guergana ;
Xia, Fei ;
Lingren, Todd ;
Li, Qi ;
Marsolo, Keith ;
Jegga, Anil ;
Kaiser, Megan ;
Stoutenborough, Laura ;
Solti, Imre .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (01) :84-94
[5]   De-identification of patient notes with recurrent neural networks [J].
Dernoncourt, Franck ;
Lee, Ji Young ;
Uzuner, Ozlem ;
Szolovits, Peter .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2017, 24 (03) :596-606
[6]   Evaluating current automatic de-identification methods with Veteran's health administration clinical documents [J].
Ferrandez, Oscar ;
South, Brett R. ;
Shen, Shuying ;
Friedlin, F. Jeffrey ;
Samore, Matthew H. ;
Meystre, Stephane M. .
BMC MEDICAL RESEARCH METHODOLOGY, 2012, 12
[7]  
Ferrucci D., 2004, Natural Language Engineering, V10, P327, DOI 10.1017/S1351324904003523
[8]   Building the graph of medicine from millions of clinical narratives [J].
Finlayson, Samuel G. ;
LePendu, Paea ;
Shah, Nigam H. .
SCIENTIFIC DATA, 2014, 1
[9]   PhysioBank, PhysioToolkit, and PhysioNet - Components of a new research resource for complex physiologic signals [J].
Goldberger, AL ;
Amaral, LAN ;
Glass, L ;
Hausdorff, JM ;
Ivanov, PC ;
Mark, RG ;
Mietus, JE ;
Moody, GB ;
Peng, CK ;
Stanley, HE .
CIRCULATION, 2000, 101 (23) :E215-E220
[10]   ADEPt, a semantically-enriched pipeline for extracting adverse drug events from free-text electronic health records [J].
Iqbal, Ehtesham ;
Mallah, Robbie ;
Rhodes, Daniel ;
Wu, Honghan ;
Romero, Alvin ;
Chang, Nynn ;
Dzahini, Olubanke ;
Pandey, Chandra ;
Broadbent, Matthew ;
Stewart, Robert ;
Dobson, Richard J. B. ;
Ibrahim, Zina M. .
PLOS ONE, 2017, 12 (11)