A machine learning based approach to identify protected health information in Chinese clinical text

被引:16
作者
Du, Liting [1 ]
Xia, Chenxi [1 ]
Deng, Zhaohua [1 ]
Lu, Gary [2 ]
Xia, Shuxu [1 ]
Ma, Jingdong [1 ]
机构
[1] Huazhong Univ Sci & Technol, Tongji Med Coll, Sch Med & Hlth Management, 13 Hangkong Rd, Wuhan 430030, Hubei, Peoples R China
[2] Dassault Syst, 175 Wyman St, Waltham, MA 02451 USA
关键词
Protected health information; De-identification; Electronic health records; Conditional random fields; ELECTRONIC MEDICAL-RECORD; DE-IDENTIFICATION METHOD; OF-THE-ART; LANGUAGE; IMPACT;
D O I
10.1016/j.ijmedinf.2018.05.010
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Background: With the increasing application of electronic health records (EHRs) in the world, protecting private information in clinical text has drawn extensive attention from healthcare providers to researchers. De-identification, the process of identifying and removing protected health information (PHI) from clinical text, has been central to the discourse on medical privacy since 2006. While de-identification is becoming the global norm for handling medical records, there is a paucity of studies on its application on Chinese clinical text. Without efficient and effective privacy protection algorithms in place, the use of indispensable clinical information would be confined. Objectives: We aimed to (i) describe the current process for PHI in China, (ii) propose a machine learning based approach to identify PHI in Chinese clinical text, and (iii) validate the effectiveness of the machine learning algorithm for de-identification in Chinese clinical text. Methods: Based on 14,719 discharge summaries from regional health centers in Ya'an City, Sichuan province, China, we built a conditional random fields (CRF) model to identify PHI in clinical text, and then used the regular expressions to optimize the recognition results of the PHI categories with fewer samples. Results: We constructed a Chinese clinical text corpus with PHI tags through substantial manual annotation, wherein the descriptive statistics of PHI manifested its wide range and diverse categories. The evaluation showed with a high F-measure of 0.9878 that our CRF-based model had a good performance for identifying PHI in Chinese clinical text. Conclusion: The rapid adoption of EHR in the health sector has created an urgent need for tools that can parse patient specific information from Chinese clinical text. Our application of CRF algorithms for de-identification has shown the potential to meet this need by offering a highly accurate and flexible solution to analyzing Chinese clinical text.
引用
收藏
页码:24 / 32
页数:9
相关论文
共 40 条
  • [1] [Anonymous], 2001, P 18 INT C MACH LEAR
  • [2] [Anonymous], 2016, ANN DAT REL CALL PAR
  • [3] [Anonymous], 2017, International Classification of Diseases
  • [4] [Anonymous], 2013, CRF YET ANOTHER CRF
  • [5] Centers for Medicare & Medicaid Services, 2017, DAT PROGR REP
  • [6] Chang F, 2015, CAN FAM PHYSICIAN, V61, P1076
  • [7] Proposal and evaluation of FASDIM, a Fast And Simple De-Identification Method for unstructured free-text clinical records
    Chazard, Emmanuel
    Mouret, Capucine
    Ficheur, Gregoire
    Schaffar, Aurelien
    Beuscart, Jean-Baptiste
    Beuscart, Regis
    [J]. INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2014, 83 (04) : 303 - 312
  • [8] De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields
    Dalianis, Hercules
    Velupillai, Sumithra
    [J]. JOURNAL OF BIOMEDICAL SEMANTICS, 2010, 1
  • [9] Combining knowledge- and data-driven methods for de-identification of clinical narratives
    Dehghan, Azad
    Kovacevic, Aleksandar
    Karystianis, George
    Keane, John A.
    Nenadic, Goran
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 58 : S53 - S59
  • [10] Large-scale evaluation of automated clinical note de-identification and its impact on information extraction
    Deleger, Louise
    Molnar, Katalin
    Savova, Guergana
    Xia, Fei
    Lingren, Todd
    Li, Qi
    Marsolo, Keith
    Jegga, Anil
    Kaiser, Megan
    Stoutenborough, Laura
    Solti, Imre
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (01) : 84 - 94