De-identifying free text of Japanese electronic health records

被引:6
作者
Kajiyama, Kohei [1 ]
Horiguchi, Hiromasa [2 ]
Okumura, Takashi [3 ]
Morita, Mizuki [4 ]
Kano, Yoshinobu [1 ]
机构
[1] Shizuoka Univ, Fac Informat, Naka Ku, Johoku 3-5-1, Hamamatsu, Shizuoka 4328011, Japan
[2] Natl Hosp Org Headquaters, Meguro Ku, 2-5-21 Higashigaoka, Tokyo 1528621, Japan
[3] Natl Univ Corp Kitami Inst Technol, 165 Koencho, Kitami, Hokkaido 0908507, Japan
[4] Okayama Univ, Grad Sch Interdisciplinary Sci & Engn Hlth Syst, Kita Ku, Okayama 7008558, Japan
关键词
De-identification; Electronic health records; Japanese language; IDENTIFICATION; FRENCH;
D O I
10.1186/s13326-020-00227-9
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background Recently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset. Results Using three datasets, we implemented de-identification systems for Japanese EHRs and compared the de-identification performances found for rule-based, Conditional Random Fields (CRF), and Long-Short Term Memory (LSTM)-based methods. Gold standard tags for de-identification are annotated manually forage, hospital,person,sex, andtime. We used different combinations of our datasets to train and evaluate our three methods. Our best F1-scores were 84.23, 68.19, and 81.67 points, respectively, for evaluations of the MedNLP dataset, a dummy EHR dataset that was virtually written by a medical doctor, and a Pathology Report dataset. Our LSTM-based method was the best performing, except for the MedNLP dataset. The rule-based method was best for the MedNLP dataset. The LSTM-based method achieved a good score of 83.07 points for this MedNLP dataset, which differs by 1.16 points from the best score obtained using the rule-based method. Results suggest that LSTM adapted well to different characteristics of our datasets. Our LSTM-based method performed better than our CRF-based method, yielding a 7.41 point F1-score, when applied to our Pathology Report dataset. This report is the first of study applying this LSTM-based method to any de-identification task of a Japanese EHR. Conclusions Our LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of our rule-based methods. However, machine learning methods are inadequate for processing expressions with low occurrence. Our future work will specifically examine the combination of LSTM and rule-based methods to achieve better performance. Our currently achieved level of performance is sufficiently higher than that of publicly available Japanese de-identification tools. Therefore, our system will be applied to actual de-identification tasks in hospitals.
引用
收藏
页数:12
相关论文
共 23 条
[1]  
[Anonymous], 2013, P NTCIR 10 C
[2]  
Aramaki E., 2014, P 11 NTCIR C, P147
[3]   SUPPORT-VECTOR NETWORKS [J].
CORTES, C ;
VAPNIK, V .
MACHINE LEARNING, 1995, 20 (03) :273-297
[4]   De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields [J].
Dalianis, Hercules ;
Velupillai, Sumithra .
JOURNAL OF BIOMEDICAL SEMANTICS, 2010, 1
[5]  
Dalianis Hercules, 2009, WOMEN, V219, P1
[6]   De-identification of patient notes with recurrent neural networks [J].
Dernoncourt, Franck ;
Lee, Ji Young ;
Uzuner, Ozlem ;
Szolovits, Peter .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2017, 24 (03) :596-606
[7]   A machine learning based approach to identify protected health information in Chinese clinical text [J].
Du, Liting ;
Xia, Chenxi ;
Deng, Zhaohua ;
Lu, Gary ;
Xia, Shuxu ;
Ma, Jingdong .
INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2018, 116 :24-32
[8]   De-identification of clinical notes in French: towards a protocol for reference corpus development [J].
Grouin, Cyril ;
Neveol, Aurelie .
JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 50 :151-161
[9]   Automatic De-Identification of French Clinical Records: Comparison of Rule-Based and Machine-Learning Approaches [J].
Grouin, Cyril ;
Zweigenbaum, Pierre .
MEDINFO 2013: PROCEEDINGS OF THE 14TH WORLD CONGRESS ON MEDICAL AND HEALTH INFORMATICS, PTS 1 AND 2, 2013, 192 :476-480
[10]  
Hatano Kenji, 2003, AMIA Annu Symp Proc, P859