A dynamic ensemble approach to robust classification in the presence of missing data

被引:0
作者
Bryan Conroy
Larry Eshelman
Cristhian Potes
Minnan Xu-Wilson
机构
[1] Philips Research North America,
来源
Machine Learning | 2016年 / 102卷
关键词
Missing data; Ensemble methods; Hemodynamic instability;
D O I
暂无
中图分类号
学科分类号
摘要
Many real-world datasets suffer from missing or incomplete data. In the healthcare setting, for example, certain patient measurement parameters, such as vitals and/or lab values, may be missing due to insufficient monitoring. When present, however, these features could be highly discriminative in predicting aspects of patient state. Therefore, it is desirable to incorporate these sparsely measured features into a predictive model. Training predictive algorithms on such datasets is complicated by the missing data. Overcoming this problem is usually achieved by first estimating values for the missing data, which is referred to as data imputation. Without strong prior knowledge about the relationship between features though, it is common to fill in missing values with their respective population mean or median. The accuracy of this approach is limited, however, and may simply inject noise into the data. We propose a two-stage machine learning algorithm that learns a dynamic classifier ensemble from an incomplete dataset without data imputation. The algorithm is very simple to implement and applicable across a wide range of problems. Our method first employs a variant of AdaBoost to learn a set of low-dimensional classifiers, each of which abstains from predicting if its dependent feature(s) are missing. Our novel contribution is the secondary dynamic ensemble learning stage in which the low-dimensional classifiers are combined using a dynamic weighting that depends on the pattern of measured features in the present input data. This allows the model to be resilient to missing data by adjusting the strength of certain classifiers to account for missing features. We apply our algorithm to early detection of hemodynamic instability in ICU patients. Providing an effective risk score of hemodynamic instability has the potential to give the clinician sufficient time to intervene, thereby reducing the chance of organ damage due to insufficient blood perfusion. We compare the results of our algorithm to other common missing data approaches, including mean imputation and multiple imputation methods, and discuss the advantages of the approach given the constraints of the application domain (e.g., high specificity to combat hospital alarm fatigue).
引用
收藏
页码:443 / 463
页数:20
相关论文
共 22 条
  • [1] Breiman L(1996)Stacked regressions Machine Learning 24 49-64
  • [2] Frassica J(2005)Frequency of laboratory test utilization in the intensive care unit and its implications for large-scale data collection efforts Journal of the American Medical Informatics Association 12 229-233
  • [3] Freund Y(1999)A short introduction to boosting Journal of Japanese Society for Artificial Intelligence 14 771-780
  • [4] Schapire R(2011)Amelia II: A program for missing data Journal of Statistical Software 45 1-47
  • [5] Honaker J(2010)Interpreting and using the arterial blood gas analysis. Nursing 2014 Critical Care 5 26-36
  • [6] King G(2010)The eicu research institute: A collaboration between industry, health-care providers, and academia Engineering in Medicine and Biology Magazine, IEEE 2 18-25
  • [7] Blackwell M(1989)A unified approach to the change of resolution: Space and gray-level IEEE Transactions on Pattern Analysis and Machine Intelligence 11 739-742
  • [8] Lian J(1999)Improved boosting algorithms using confidence-rated predictions Machine Learning 37 297-336
  • [9] McShea M(2010)Handling missing features with boosting algorithms for protein-protein interaction prediction Lecture Notes in Computer Science: Data Integration in the Life Sciences 6254 132-147
  • [10] Holl R(1992)Stacked generalization Neural Networks 5 241-259