Using methods from the data-mining and machine-learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes

被引:192
作者
Austin, Peter C. [1 ,2 ,3 ]
Tu, Jack V. [1 ,2 ,4 ,5 ]
Ho, Jennifer E. [6 ,7 ,8 ]
Levy, Daniel [6 ,7 ,9 ]
Lee, Douglas S. [1 ,2 ,5 ,10 ]
机构
[1] Inst Clin Evaluat Sci, Toronto, ON M4N 3M5, Canada
[2] Univ Toronto, Inst Hlth Management Policy & Evaluat, Toronto, ON, Canada
[3] Univ Toronto, Dalla Lana Sch Publ Hlth, Toronto, ON, Canada
[4] Univ Toronto, Div Cardiol, Sunnybrook Schulich Heart Ctr, Toronto, ON, Canada
[5] Univ Toronto, Fac Med, Toronto, ON, Canada
[6] NHLBI, Framingham Heart Study, Framingham, MA USA
[7] NHLBI, Ctr Populat Studies, Bethesda, MD 20892 USA
[8] Boston Univ, Dept Med, Sect Cardiovasc Med, Boston, MA 02118 USA
[9] Boston Univ, Sch Med, Dept Med, Boston, MA 02118 USA
[10] Univ Toronto, Univ Hlth Network, Dept Med, Toronto, ON, Canada
基金
加拿大健康研究院;
关键词
Boosting; Classification trees; Bagging; Random forests; Classification; Regression trees; Support vector machines; Regression methods; Prediction; Heart failure; REDUCED EJECTION FRACTION; LOGISTIC-REGRESSION; TREES; IMPROVEMENT;
D O I
10.1016/j.jclinepi.2012.11.008
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Objective: Physicians classify patients into those with or without a specific disease. Furthermore, there is often interest in classifying patients according to disease etiology or subtype. Classification trees are frequently used to classify patients according to the presence or absence of a disease. However, classification trees can suffer from limited accuracy. In the data-mining and machine-learning literature, alternate classification schemes have been developed. These include bootstrap aggregation (bagging), boosting, random forests, and support vector machines. Study Design and Setting: We compared the performance of these classification methods with that of conventional classification trees to classify patients with heart failure (HF) according to the following subtypes: HF with preserved ejection fraction (HFPEF) and HF with reduced ejection fraction. We also compared the ability of these methods to predict the probability of the presence of HFPEF with that of conventional logistic regression. Results: We found that modern, flexible tree-based methods from the data-mining literature offer substantial improvement in prediction and classification of HF subtype compared with conventional classification and regression trees. However, conventional logistic regression had superior performance for predicting the probability of the presence of HFPEF compared with the methods proposed in the data-mining literature. Conclusion: The use of tree-based methods offers superior performance over conventional classification and regression trees for predicting and classifying HF subtypes in a population-based sample of patients from Ontario, Canada. However, these methods do not offer substantial improvements over logistic regression for predicting the presence of HFPEF. (C) 2013 Elsevier Inc. All rights reserved.
引用
收藏
页码:398 / 407
页数:10
相关论文
共 37 条
[21]  
Harrell FrankE., 2009, Design: Design Package
[22]   Discriminating clinical features of heart failure with preserved vs. reduced ejection fraction in the community [J].
Ho, Jennifer E. ;
Gona, Philimon ;
Pencina, Michael J. ;
Tu, Jack V. ;
Austin, Peter C. ;
Vasan, Ramachandran S. ;
Kannel, William B. ;
D'Agostino, Ralph B. ;
Lee, Douglas S. ;
Levy, Daniel .
EUROPEAN HEART JOURNAL, 2012, 33 (14) :1734-1741
[23]   2009 Focused Update Incorporated Into the ACC/AHA 2005 Guidelines for the Diagnosis and Management of Heart Failure in Adults A Report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice Guidelines Developed in Collaboration With the International Society for Heart and Lung Transplantation [J].
Hunt, Sharon Ann ;
Abraham, William T. ;
Chin, Marshall H. ;
Feldman, Arthur M. ;
Francis, Gary S. ;
Ganiats, Theodore G. ;
Jessup, Mariell ;
Konstam, Marvin A. ;
Mancini, Donna M. ;
Michl, Keith ;
Oates, John A. ;
Rahko, Peter S. ;
Silver, Marc A. ;
Stevenson, Lynne Warner ;
Yancy, Clyde W. ;
Casey, Donald E. ;
Smith, Sidney C., Jr. ;
Jacobs, Alice K. ;
Buller, Christopher E. ;
Creager, Mark A. ;
Ettinger, Steven M. ;
Krumholz, Harlan M. ;
Kushner, Frederick G. ;
Lytle, Bruce W. ;
Nishimura, Rick A. ;
Page, Richard L. ;
Tarkington, Lynn G. ;
Lewin, John C. ;
May, Charlene ;
Stewart, Mark D. ;
Keller, Sue ;
McDougall, Allison ;
Brown, Nancy ;
Whitman, Gayle R. .
JOURNAL OF THE AMERICAN COLLEGE OF CARDIOLOGY, 2009, 53 (15) :E1-E90
[24]   Relation of Disease Pathogenesis and Risk Factors to Heart Failure With Preserved or Reduced Ejection Fraction Insights From the Framingham Heart Study of the National Heart, Lung, and Blood Institute [J].
Lee, Douglas S. ;
Gona, Philimon ;
Vasan, Ramachandran S. ;
Larson, Martin G. ;
Benjamin, Emelia J. ;
Wang, Thomas J. ;
Tu, Jack V. ;
Levy, Daniel .
CIRCULATION, 2009, 119 (24) :3070-3077
[25]   Classification and regression tree analysis in public health: Methodological review and comparison with logistic regression [J].
Lemon, SC ;
Roy, J ;
Clark, MA ;
Friedmann, PD ;
Rakowski, W .
ANNALS OF BEHAVIORAL MEDICINE, 2003, 26 (03) :172-181
[26]   Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests [J].
Maroco J. ;
Silva D. ;
Rodrigues A. ;
Guerreiro M. ;
Santana I. ;
De Mendonça A. .
BMC Research Notes, 4 (1)
[27]   Gender, age, and heart failure with preserved left ventricular systolic function [J].
Masoudi, FA ;
Havranek, EP ;
Smith, G ;
Fish, RH ;
Steiner, JF ;
Ordin, DL ;
Krumholz, HM .
JOURNAL OF THE AMERICAN COLLEGE OF CARDIOLOGY, 2003, 41 (02) :217-223
[28]   Propensity score estimation with boosted regression for evaluating causal effects in observational studies [J].
McCaffrey, DF ;
Ridgeway, G ;
Morral, AR .
PSYCHOLOGICAL METHODS, 2004, 9 (04) :403-425
[29]   Recursive partitioning for the identification of disease risk subgroups: A case-control study of subarachnoid hemorrhage [J].
Nelson, LM ;
Bloch, DA ;
Longstreth, WT ;
Shi, H .
JOURNAL OF CLINICAL EPIDEMIOLOGY, 1998, 51 (03) :199-209
[30]  
Peters A., 2009, IPRED IMPROVED PREDI