Automatic classification of diseases from free-text death certificates for real-time surveillance

被引:36
作者
Koopman, Bevan [1 ]
Karimi, Sarvnaz [1 ]
Nguyen, Anthony [1 ]
McGuire, Rhydwyn [2 ]
Muscatello, David [2 ]
Kemp, Madonna [1 ]
Truran, Donna [1 ]
Zhang, Ming [1 ]
Thackway, Sarah [2 ]
机构
[1] CSIRO, Royal Brisbane & Womens Hosp, Australian E Hlth Res Ctr, Brisbane, Qld, Australia
[2] NSW Minist Hlth, Sydney, NSW, Australia
来源
BMC MEDICAL INFORMATICS AND DECISION MAKING | 2015年 / 15卷
关键词
Syndromic surveillance; Machine learning; Death certificates; PUBLIC-HEALTH SURVEILLANCE; NEW-SOUTH-WALES; SYNDROMIC SURVEILLANCE; IDENTIFICATION; PRESENTATIONS; INFLUENZA;
D O I
10.1186/s12911-015-0174-2
中图分类号
R-058 [];
学科分类号
摘要
Background: Death certificates provide an invaluable source for mortality statistics which can be used for surveillance and early warnings of increases in disease activity and to support the development and monitoring of prevention or response strategies. However, their value can be realised only if accurate, quantitative data can be extracted from death certificates, an aim hampered by both the volume and variable nature of certificates written in natural language. This study aims to develop a set of machine learning and rule-based methods to automatically classify death certificates according to four high impact diseases of interest: diabetes, influenza, pneumonia and HIV. Methods: Two classification methods are presented: i) a machine learning approach, where detailed features (terms, term n-grams and SNOMED CT concepts) are extracted from death certificates and used to train a set of supervised machine learning models (Support Vector Machines); and ii) a set of keyword-matching rules. These methods were used to identify the presence of diabetes, influenza, pneumonia and HIV in a death certificate. An empirical evaluation was conducted using 340,142 death certificates, divided between training and test sets, covering deaths from 2000-2007 in New South Wales, Australia. Precision and recall (positive predictive value and sensitivity) were used as evaluation measures, with F-measure providing a single, overall measure of effectiveness. A detailed error analysis was performed on classification errors. Results: Classification of diabetes, influenza, pneumonia and HIV was highly accurate (F-measure 0.96). More fine-grained ICD-10 classification effectiveness was more variable but still high (F-measure 0.80). The error analysis revealed that word variations as well as certain word combinations adversely affected classification. In addition, anomalies in the ground truth likely led to an underestimation of the effectiveness. Conclusions: The high accuracy and low cost of the classification methods allow for an effective means for automatic and real-time surveillance of diabetes, influenza, pneumonia and HIV deaths. In addition, the methods are generally applicable to other diseases of interest and to other sources of medical free-text besides death certificates.
引用
收藏
页数:10
相关论文
共 20 条
  • [1] [Anonymous], 2014, WORLD HLTH STAT 2014
  • [2] [Anonymous], 2012, 33030 AUSTR BUR STAT
  • [3] [Anonymous], 2008, MORBIDITY MORTALITY
  • [4] Automatic Extraction of Cancer Characteristics from Free-Text Pathology Reports for Cancer Notifications
    Anthony Nguyen
    Moore, Julie
    Lawley, Michael
    Hansen, David
    Colquist, Shoni
    [J]. HEALTH INFORMATICS: THE TRANSFORMATIVE POWER OF INNOVATION, 2011, 168 : 117 - 124
  • [5] Classification of Cancer-related Death Certificates using Machine Learning
    Butt, Luke
    Zuccon, Guido
    Nguyen, Anthony
    Bergheim, Anton
    Grayson, Narelle
    [J]. AUSTRALASIAN MEDICAL JOURNAL, 2013, 6 (05): : 292 - 299
  • [6] Centre for Disease Control and Prevention, 2014, INSTR CLASS UND CAUS
  • [7] Centre for Epidemiology and Evidence Health Statistics New South Wales, HLTH STAT NEW S WAL
  • [8] Identification of pneumonia and influenza deaths using the death certificate pipeline
    Davis, Kailah
    Staes, Catherine
    Duncan, Jeff
    Igo, Sean
    Facelli, Julio C.
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2012, 12
  • [9] Espino Jeremy U, 2003, AMIA Annu Symp Proc, P215
  • [10] Espino Jeremy U, 2004, MMWR Suppl, V53, P32