Automated deidentification of radiology reports combining transformer and "hide in plain sight" rule-based methods

被引:23
作者
Chambon, Pierre J. [1 ,2 ,5 ]
Wu, Christopher [3 ]
Steinkamp, Jackson M. [3 ]
Adleberg, Jason [4 ]
Cook, Tessa S. [3 ]
Langlotz, Curtis P. [1 ]
机构
[1] Stanford Univ, Dept Radiol, Stanford, CA USA
[2] Paris Saclay Univ, Ecole Cent Paris, Dept Appl Math & Engn, Paris, France
[3] Univ Penn, Dept Radiol, Philadelphia, PA USA
[4] Mt Sinai Hlth Syst, Dept Radiol, New York, NY USA
[5] Stanford Univ, Dept Radiol, 300 Pasteur Dr, Stanford, CA 94305 USA
基金
美国国家卫生研究院;
关键词
deidentification; radiology; machine learning; NLP; transformer; DE-IDENTIFICATION;
D O I
10.1093/jamia/ocac219
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective To develop an automated deidentification pipeline for radiology reports that detect protected health information (PHI) entities and replaces them with realistic surrogates "hiding in plain sight." Materials and Methods In this retrospective study, 999 chest X-ray and CT reports collected between November 2019 and November 2020 were annotated for PHI at the token level and combined with 3001 X-rays and 2193 medical notes previously labeled, forming a large multi-institutional and cross-domain dataset of 6193 documents. Two radiology test sets, from a known and a new institution, as well as i2b2 2006 and 2014 test sets, served as an evaluation set to estimate model performance and to compare it with previously released deidentification tools. Several PHI detection models were developed based on different training datasets, fine-tuning approaches and data augmentation techniques, and a synthetic PHI generation algorithm. These models were compared using metrics such as precision, recall and F1 score, as well as paired samples Wilcoxon tests. Results Our best PHI detection model achieves 97.9 F1 score on radiology reports from a known institution, 99.6 from a new institution, 99.5 on i2b2 2006, and 98.9 on i2b2 2014. On reports from a known institution, it achieves 99.1 recall of detecting the core of each PHI span. Discussion Our model outperforms all deidentifiers it was compared to on all test sets as well as human labelers on i2b2 2014 data. It enables accurate and automatic deidentification of radiology reports. Conclusions A transformer-based deidentification pipeline can achieve state-of-the-art performance for deidentifying radiology reports and other medical documents.
引用
收藏
页码:318 / 328
页数:11
相关论文
共 23 条
[1]   The MITRE Identification Scrubber Toolkit: Design, training, and assessment [J].
Aberdeen, John ;
Bayer, Samuel ;
Yeniterzi, Reyyan ;
Wellner, Ben ;
Clark, Cheryl ;
Hanauer, David ;
Malin, Bradley ;
Hirschman, Lynette .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2010, 79 (12) :849-859
[2]  
Bergstra J.S., 2011, ADV NEURAL INFORM PR
[3]   Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text [J].
Carrell, David ;
Malin, Bradley ;
Aberdeen, John ;
Bayer, Samuel ;
Clark, Cheryl ;
Wellner, Ben ;
Hirschman, Lynette .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (02) :342-348
[4]   The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight [J].
Carrell, David S. ;
Cronkite, David J. ;
Li, Muqun ;
Nyemba, Steve ;
Malin, Bradley A. ;
Aberdeen, John S. ;
Hirschman, Lynette .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2019, 26 (12) :1536-1544
[5]   Improved Fine-Tuning of In-Domain Transformer Model for Inferring COVID-19 Presence in Multi-Institutional Radiology Reports [J].
Chambon, Pierre ;
Cook, Tessa S. ;
Langlotz, Curtis P. .
JOURNAL OF DIGITAL IMAGING, 2023, 36 (01) :164-177
[6]   De-identification of patient notes with recurrent neural networks [J].
Dernoncourt, Franck ;
Lee, Ji Young ;
Uzuner, Ozlem ;
Szolovits, Peter .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2017, 24 (03) :596-606
[7]  
Dernoncourt Franck., 2017, P 2017 C EMP METH NA, P97, DOI DOI 10.18653/V1/D17-2017
[8]   An integrated framework for de-identifying unstructured medical data [J].
Gardner, James ;
Xiong, Li .
DATA & KNOWLEDGE ENGINEERING, 2009, 68 (12) :1441-1451
[9]   Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing [J].
Gu, Yu ;
Tinn, Robert ;
Cheng, Hao ;
Lucas, Michael ;
Usuyama, Naoto ;
Liu, Xiaodong ;
Naumann, Tristan ;
Gao, Jianfeng ;
Poon, Hoifung .
ACM TRANSACTIONS ON COMPUTING FOR HEALTHCARE, 2022, 3 (01)
[10]  
Howard J. M., ARXIV