Implications of non-stationarity on predictive modeling using EHRs

被引:35
作者
Jung, Kenneth [1 ]
Shah, Nigam H. [2 ]
机构
[1] Stanford Univ, Program Biomed Informat, Stanford, CA 94305 USA
[2] Stanford Univ, Ctr Biomed Informat Res, Stanford, CA 94305 USA
关键词
Data mining; Machine learning; Predictive model; Prognostic model; Wound healing; HEALTH; RISK;
D O I
10.1016/j.jbi.2015.10.006
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The rapidly increasing volume of clinical information captured in Electronic Health Records (EHRs) has led to the application of increasingly sophisticated models for purposes such as disease subtype discovery and predictive modeling. However, increasing adoption of EHRs implies that in the near future, much of the data available for such purposes will be from a time period during which both the practice of medicine and the clinical use of EHRs are in flux due to historic changes in both technology and incentives. In this work, we explore the implications of this phenomenon, called non-stationarity, on predictive modeling. We focus on the problem of predicting delayed wound healing using data available in the EHR during the first week of care in outpatient wound care centers, using a large dataset covering over 150,000 individual wounds and 59,958 patients seen over a period of four years. We manipulate the degree of nonstationarity seen by the model development process by changing the way data is split into training and test sets. We demonstrate that non-stationarity can lead to quite different conclusions regarding the relative merits of different models with respect to predictive power and calibration of their posterior probabilities. Under the non-stationarity exhibited in this dataset, the performance advantage of complex methods such as stacking relative to the best simple classifier disappears. Ignoring non-stationarity can thus lead to sub-optimal model selection in this task.(C) 2015 Elsevier Inc. All rights reserved.
引用
收藏
页码:168 / 174
页数:7
相关论文
共 26 条
[1]  
[Anonymous], 2013, ARXIV13084214
[2]   Big Data In Health Care: Using Analytics To Identify And Manage High-Risk And High-Cost Patients [J].
Bates, David W. ;
Saria, Suchi ;
Ohno-Machado, Lucila ;
Shah, Anand ;
Escobar, Gabriel .
HEALTH AFFAIRS, 2014, 33 (07) :1123-1131
[3]   Defining a comprehensive verotype using electronic health records for personalized medicine [J].
Boland, Mary Regina ;
Hripcsak, George ;
Shen, Yufeng ;
Chung, Wendy K. ;
Weng, Chunhua .
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (E2) :E232-E238
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]  
Cook S, 2011, PLOS ONE, V6, DOI [10.1371/journal.pone.0023610, 10.1371/journal.pone.0025407]
[6]  
DeGroot M., 1982, STATISTICIAN, V32
[7]   Evidence Generating Medicine Redefining the Research-Practice Relationship to Complete the Evidence Cycle [J].
Embi, Peter J. ;
Payne, Philip R. O. .
MEDICAL CARE, 2013, 51 (08) :S87-S91
[8]   Achieving a Nationwide Learning Health System [J].
Friedman, Charles P. ;
Wong, Adam K. ;
Blumenthal, David .
SCIENCE TRANSLATIONAL MEDICINE, 2010, 2 (57)
[9]   Regularization Paths for Generalized Linear Models via Coordinate Descent [J].
Friedman, Jerome ;
Hastie, Trevor ;
Tibshirani, Rob .
JOURNAL OF STATISTICAL SOFTWARE, 2010, 33 (01) :1-22
[10]   Greedy function approximation: A gradient boosting machine [J].
Friedman, JH .
ANNALS OF STATISTICS, 2001, 29 (05) :1189-1232