Missing data in medical databases: Impute, delete or classify?

被引:118
作者
Cismondi, Federico [1 ,2 ,3 ]
Fialho, Andre S. [1 ,2 ,3 ]
Vieira, Susana M. [2 ]
Reti, Shane R. [3 ]
Sousa, Joao M. C. [2 ]
Finkelstein, Stan N. [1 ]
机构
[1] MIT, Engn Syst Div, Cambridge, MA 02139 USA
[2] Univ Tecn Lisboa, Inst Super Tecn, Dept Mech Engn, CIS IDMEC LAETA, P-1049001 Lisbon, Portugal
[3] Harvard Univ, Beth Israel Deaconess Med Ctr, Sch Med, Div Clin Informat,Dept Med, Boston, MA 02215 USA
关键词
Missing data classification; Statistical classifier; Fuzzy systems; Test bed; Intensive care unit; SYSTEMS;
D O I
10.1016/j.artmed.2013.01.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Background: The multiplicity of information sources for data acquisition in modern intensive care units (ICUs) makes the resulting databases particularly susceptible to missing data. Missing data can significantly affect the performance of predictive risk modeling, an important technique for developing medical guidelines. The two most commonly used strategies for managing missing data are to impute or delete values, and the former can cause bias, while the later can cause both bias and loss of statistical power. Objectives: In this paper we present a new approach for managing missing data in ICU databases in order to improve overall modeling performance. Methods: We use a statistical classifier followed by fuzzy modeling to more accurately determine which missing data should be imputed and which should not. We firstly develop a simulation test bed to evaluate performance, and then translate that knowledge using exactly the same database as previously published work by [13]. Results: In this work, test beds resulted in datasets with missing data ranging 10-50%. Using this new approach to missing data we are able to significantly improve modeling performance parameters such as accuracy of classifications by an 11%, sensitivity by 13%, and specificity by 10%, including also area under the receiver-operator curve (AUC) improvement of up to 13%. Conclusions: In this work, we improve modeling performance in a simulated test bed, and then confirm improved performance replicating previously published work by using the proposed approach for missing data classification. We offer this new method to other researchers who wish to improve predictive risk modeling performance in the ICU through advanced missing data management. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:63 / 72
页数:10
相关论文
共 40 条
[1]  
Acock A., 1997, FAMILY SCI REV, V1, P76
[2]   Visual methods for analyzing time-oriented data [J].
Aigner, Wolfgang ;
Miksch, Silvia ;
Muller, Wolfgang ;
Schumann, Heidrun ;
Tominski, Christian .
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2008, 14 (01) :47-60
[3]  
Allison PD, 2001, SAGE U PAPERS SERIES
[4]  
[Anonymous], 2009, USER GUIDE DOCUMENTA
[5]  
[Anonymous], COMPUTER METHODS PRO
[6]  
[Anonymous], 1998, Feature Extraction, Construction and Selection: A Data Mining Perspective
[7]   Applications of multiple imputation in medical studies: from AIDS as NHANES [J].
Barnard, J ;
Meng, XL .
STATISTICAL METHODS IN MEDICAL RESEARCH, 1999, 8 (01) :17-36
[8]   Critical care delivery in the intensive care unit: Defining clinical roles and the best practice model [J].
Brilli, RJ ;
Spevetz, A ;
Branson, RD ;
Campbell, GM ;
Cohen, H ;
Dasta, JF ;
Harvey, MA ;
Kelley, MA ;
Kelly, KM ;
Rudis, MI ;
St Andre, AC ;
Stone, JR ;
Teres, D ;
Weled, BJ .
CRITICAL CARE MEDICINE, 2001, 29 (10) :2007-2019
[9]  
Cios KJ, 2005, KNOWLEDGE DISCOVERY, P200
[10]  
Cismondi F., 2011, Proceedings 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2011), P224, DOI 10.1109/CIDM.2011.5949447