A Context-Enhanced De-identification System

被引:2
作者
Lee, Kahyun [1 ]
Kayaalp, Mehmet [2 ]
Henry, Sam [1 ]
Uzuner, Oezlem [1 ]
机构
[1] George Mason Univ, 4400 Univ Dr, Fairfax, VA 22030 USA
[2] US Natl Lib Med, 8600 Rockville Pike, Bethesda, MD 20894 USA
来源
ACM TRANSACTIONS ON COMPUTING FOR HEALTHCARE | 2022年 / 3卷 / 01期
基金
美国国家卫生研究院;
关键词
De-identification; HIPAA; entity recognition; information extraction; natural language processing; INFORMATION;
D O I
10.1145/3470980
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Many modern entity recognition systems, including the current state-of-the-art de-identification systems, are based on bidirectional long short-term memory (biLSTM) units augmented by a conditional random field (CRF) sequence optimizer. These systems process the input sentence by sentence. This approach prevents the systems from capturing dependencies over sentence boundaries and makes accurate sentence boundary detection a prerequisite. Since sentence boundary detection can be problematic especially in clinical reports, where dependencies and co-references across sentence boundaries are abundant, these systems have clear limitations. In this study, we built a new system on the framework of one of the current state-of-the-art de-identification systems, NeuroNER, to overcome these limitations. This new system incorporates context embeddings through forward and backward n-grams without using sentence boundaries. Our context-enhanced de-identification (CEDI) system captures dependencies over sentence boundaries and bypasses the sentence boundary detection problem altogether. We enhanced this system with deep affix features and an attention mechanism to capture the pertinent parts of the input. The CEDI system outperforms NeuroNER on the 2006 i2b2 de-identification challenge dataset, the 2014 i2b2 shared task de-identification dataset, and the 2016 CEGS N-GRID de-identification dataset (p < 0.01). All datasets comprise narrative clinical reports in English but contain different note types varying from discharge summaries to psychiatric notes. Enhancing CEDI with deep affix features and the attention mechanism further increased performance.
引用
收藏
页数:14
相关论文
共 56 条
[1]  
Akbik A., 2018, P 27 INT C COMP LING, P1638
[2]  
[Anonymous], 1989, Computer-intensive methods for testing hypotheses: An introduction
[3]  
Bahdanau Dzmitry, 2015, ICLR, P1, DOI DOI 10.1146/ANNUREV.NEURO.26.041002.131047
[4]  
Buchanan Bruce G., 1994, Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Pro- gramming Project, DOI DOI 10.1007/978-1-4614-3858-8_100840
[5]  
Cheng JP, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P484
[6]  
CLEARWATER SH, 1990, PROCEEDINGS OF THE 2ND INTERNATIONAL IEEE CONFERENCE ON TOOLS FOR ARTIFICIAL INTELLIGENCE, P24, DOI 10.1109/TAI.1990.130305
[7]   De-identification of patient notes with recurrent neural networks [J].
Dernoncourt, Franck ;
Lee, Ji Young ;
Uzuner, Ozlem ;
Szolovits, Peter .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2017, 24 (03) :596-606
[8]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arXiv.1810.04805]
[9]  
Peters ME, 2018, Arxiv, DOI [arXiv:1802.05365, DOI 10.48550/ARXIV.1802.05365, DOI 10.18653/V1/N18-1202, 10.48550/arXiv.1802.05365]
[10]   BoB, a best-of-breed automated text de-identification system for VHA clinical documents [J].
Ferrandez, Oscar ;
South, Brett R. ;
Shen, Shuying ;
Friedlin, F. Jeffrey ;
Samore, Matthew H. ;
Meystre, Stephane M. .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (01) :77-83