Classification of Cancer-related Death Certificates using Machine Learning

被引:15
作者
Butt, Luke [1 ]
Zuccon, Guido [1 ]
Nguyen, Anthony [1 ]
Bergheim, Anton [2 ]
Grayson, Narelle [2 ]
机构
[1] Australian E Hlth Res Ctr, Brisbane, Qld, Australia
[2] Canc Inst NSW, Eveleigh, NSW, Australia
来源
AUSTRALASIAN MEDICAL JOURNAL | 2013年 / 6卷 / 05期
关键词
Death certificates; Cancer Registry; cancer monitoring and reporting; machine learning; natural language processing; SNOMED CT;
D O I
10.4066/AMJ.2013.1654
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Background Cancer monitoring and prevention relies on the critical aspect of timely notification of cancer cases. However, the abstraction and classification of cancer from the free-text of pathology reports and other relevant documents, such as death certificates, exist as complex and time-consuming activities. Aims In this paper, approaches for the automatic detection of notifiable cancer cases as the cause of death from free-text death certificates supplied to Cancer Registries are investigated. Method A number of machine learning classifiers were studied. Features were extracted using natural language techniques and the Medtex toolkit. The numerous features encompassed stemmed words, bi-grams, and concepts from the SNOMED CT medical terminology. The baseline consisted of a keyword spotter using keywords extracted from the long description of ICD-10 cancer related codes. Results Death certificates with notifiable cancer listed as the cause of death can be effectively identified with the methods studied in this paper. A Support Vector Machine (SVM) classifier achieved best performance with an overall Fmeasure of 0.9866 when evaluated on a set of 5,000 free-text death certificates using the token stem feature set. The SNOMED CT concept plus token stem feature set reached the lowest variance (0.0032) and false negative rate (0.0297) while achieving an F-measure of 0.9864. The SVM classifier accounts for the first 18 of the top 40 evaluated runs, and entails the most robust classifier with a variance of 0.001141, half the variance of the other classifiers. Conclusion The selection of features significantly produced the most influences on the performance of the classifiers, although the type of classifier employed also affects performance. In contrast, the feature weighting schema created a negligible effect on performance. Specifically, it is found that stemmed tokens with or without SNOMED CT concepts create the most effective feature when combined with an SVM classifier.
引用
收藏
页码:292 / 299
页数:8
相关论文
共 10 条
  • [1] Automatic Extraction of Cancer Characteristics from Free-Text Pathology Reports for Cancer Notifications
    Anthony Nguyen
    Moore, Julie
    Lawley, Michael
    Hansen, David
    Colquist, Shoni
    [J]. HEALTH INFORMATICS: THE TRANSFORMATIVE POWER OF INNOVATION, 2011, 168 : 117 - 124
  • [2] Butt L, 2012, CEUR WORKSH P, V941, P65
  • [3] Cancer survival in Australia, Canada, Denmark, Norway, Sweden, and the UK, 1995-2007 (the International Cancer Benchmarking Partnership): an analysis of population-based cancer registry data
    Coleman, M. P.
    Forman, D.
    Bryant, H.
    Butler, J.
    Rachet, B.
    Maringe, C.
    Nur, U.
    Tracey, E.
    Coory, M.
    Hatcher, J.
    McGahan, C. E.
    Turner, D.
    Marrett, L.
    Gjerstorff, M. L.
    Johannesen, T. B.
    Adolfsson, J.
    Lambe, M.
    Lawrence, G.
    Meechan, D.
    Morris, E. J.
    Middleton, R.
    Steward, J.
    Richards, M. A.
    [J]. LANCET, 2011, 377 (9760) : 127 - 138
  • [4] Evaluation of a generalizable approach to clinical information retrieval using the automated retrieval console (ARC)
    D'Avolio, Leonard W.
    Nguyen, Thien M.
    Farwell, Wildon R.
    Chen, Yongming
    Fitzmeyer, Felicia
    Harris, Owen M.
    Fiore, Louis D.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (04) : 375 - 382
  • [5] Identification of pneumonia and influenza deaths using the death certificate pipeline
    Davis, Kailah
    Staes, Catherine
    Duncan, Jeff
    Igo, Sean
    Facelli, Julio C.
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2012, 12
  • [6] Harris K., 1999, P C EUR STAT
  • [7] Symbolic rule-based classification of lung cancer stages from free-text pathology reports
    Nguyen, Anthony N.
    Lawley, Michael J.
    Hansen, David P.
    Bowman, Rayleen V.
    Clarke, Belinda E.
    Duhig, Edwina E.
    Colquist, Shoni
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (04) : 440 - 445
  • [8] Stearns MQ, 2001, P AMIA S J AM MED IN
  • [9] Witten IH, 2011, MOR KAUF D, P1
  • [10] Zuccon Guido, 2012, Stud Health Technol Inform, V178, P250