An hybrid Machine Learning method for the de-identification of Un-Structured Narrative Clinical Text in Multi-Center Chinese Electronic Medical Records Data

被引:1
作者
Jin, Meng [1 ,2 ]
Zhang, Kai [3 ]
Yang, Yunhaonan [4 ]
Xie, Shuanglian [5 ]
Song, Kai [6 ]
Hu, Yonghua [1 ,4 ]
Bao, Xiaoyuan [1 ,2 ]
机构
[1] Peking Univ, Med Informat Ctr, Beijing, Peoples R China
[2] Natl Med Serv Data Ctr, Beijing, Peoples R China
[3] Peking Univ, Hlth Sci Ctr, Beijing, Peoples R China
[4] Peking Univ, Sch Publ Hlth, Beijing, Peoples R China
[5] Peking Univ, Clin Med Coll 5, Beijing, Peoples R China
[6] China Japan Friendship Hosp, Beijing, Peoples R China
来源
2019 10TH IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (ICBK 2019) | 2019年
关键词
component; Chinese electronic medical record; Un-structured; machine learning; corpora; multi-center; OF-THE-ART; ANONYMIZATION; INFORMATION;
D O I
10.1109/ICBK.2019.00023
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The premise of the full use of unstructured electronic medical records is to maintain the fully protection of a patient's information privacy. Presently, in prior of processing the electronic medical record date, identification and removing of relevant information which can be used to identify a patient is a research hotspot nowadays. There are very few methods in de identification of Chinese electronic medical records and their cross center performance is poor. Therefore, we develop a de-identification method which is a mixture of rule-based methods and machine learning methods. The method was tested on 700 electronic medical records from six hospitals. Five-fold cross test was used to evaluate the results of c5.0, Random Forest, SVM and XGBOOST. Leave-one-out test was used to evaluate CRF. And the F1 Measure of machine learning reached 91.18% in PHI_Names, 98.21% in PHI_MEDICALID, 95.74% in PHI_OTHERNFC, 97.14% in PHI_GEO, 89.19% in PHI_DATES, and 91.49% in PHI_TEL. And the F1 Measure of rule-based methods reached 93.00% in PHI_Names, 97.00% in PHI_MEDICALID, 97.00% in PHI_OTHERNFC, 97.00% in PHI_GEO, 96.00% in PHI_DATES, and 89.00% in PHI_TEL.
引用
收藏
页码:105 / 111
页数:7
相关论文
共 33 条
  • [1] Development and evaluation of an open source software tool for deidentification of pathology reports
    Beckwith B.A.
    Mahaadevan R.
    Balis U.J.
    Kuo F.
    [J]. BMC Medical Informatics and Decision Making, 6 (1)
  • [2] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [3] XGBoost: A Scalable Tree Boosting System
    Chen, Tianqi
    Guestrin, Carlos
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 785 - 794
  • [4] Darr DA, 2006, METHOD INFORM MED, V45, P246
  • [5] De-identification of patient notes with recurrent neural networks
    Dernoncourt, Franck
    Lee, Ji Young
    Uzuner, Ozlem
    Szolovits, Peter
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2017, 24 (03) : 596 - 606
  • [6] Text Messaging and Protected Health Information What Is Permitted?
    Drolet, Brian C.
    [J]. JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2017, 317 (23): : 2369 - 2370
  • [7] BoB, a best-of-breed automated text de-identification system for VHA clinical documents
    Ferrandez, Oscar
    South, Brett R.
    Shen, Shuying
    Friedlin, F. Jeffrey
    Samore, Matthew H.
    Meystre, Stephane M.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (01) : 77 - 83
  • [8] A software tool for removing patient identifying information from clinical documents
    Friedlin, F. Jeff
    McDonald, Clement J.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2008, 15 (05) : 601 - 610
  • [9] Automatic De-Identification of French Clinical Records: Comparison of Rule-Based and Machine-Learning Approaches
    Grouin, Cyril
    Zweigenbaum, Pierre
    [J]. MEDINFO 2013: PROCEEDINGS OF THE 14TH WORLD CONGRESS ON MEDICAL AND HEALTH INFORMATICS, PTS 1 AND 2, 2013, 192 : 476 - 480
  • [10] Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research
    Gupta, D
    Saul, M
    Gilbertson, J
    [J]. AMERICAN JOURNAL OF CLINICAL PATHOLOGY, 2004, 121 (02) : 176 - 186