Using decision trees to understand structure in missing data

被引:38
作者
Tierney, Nicholas J. [1 ,2 ]
Harden, Fiona A. [3 ,4 ]
Harden, Maurice J. [5 ]
Mengersen, Kerrie L. [1 ,2 ]
机构
[1] Queensland Univ Technol, Fac Sci & Engn, Dept Stat Sci, Math Sci, Brisbane, Qld 4001, Australia
[2] ARC Ctr Excellence Math & Stat Frontiers ACEMS, Brisbane, Qld, Australia
[3] Queensland Univ Technol, Fac Hlth, Clin Sci, Brisbane, Qld 4001, Australia
[4] Inst Hlth & Biomed Innovat, Brisbane, Qld, Australia
[5] Hunter Ind Med, Newcastle, NSW, Australia
来源
BMJ OPEN | 2015年 / 5卷 / 06期
基金
澳大利亚研究理事会;
关键词
PATTERN-MIXTURE MODELS; MULTIPLE IMPUTATION; REGRESSION TREES; EXPOSURE; PROGRAM; COHORT;
D O I
10.1136/bmjopen-2014-007450
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Objectives: Demonstrate the application of decision trees-classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)-to understand structure in missing data. Setting: Data taken from employees at 3 different industrial sites in Australia. Participants: 7915 observations were included. Materials and methods: The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the 'rpart' and 'gbm' packages for CART and BRT analyses, respectively, from the statistical software 'R'. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. Results: CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness. Discussion: Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. Conclusions: Researchers are encouraged to use CART and BRT models to explore and understand missing data.
引用
收藏
页数:11
相关论文
共 32 条
[11]   A critical look at methods for handling missing covariates in epidemiologic regression analyses [J].
Greenland, S ;
Finkle, WD .
AMERICAN JOURNAL OF EPIDEMIOLOGY, 1995, 142 (12) :1255-1264
[12]  
Hastie T., 2009, The elements of statistical learning: data mining, inference, and pre- diction, V2nd ed
[13]   Application of random-effects pattern-mixture models for missing data in longitudinal studies [J].
Hedeker, D ;
Gibbons, RD .
PSYCHOLOGICAL METHODS, 1997, 2 (01) :64-78
[14]  
Honaker J, 2011, J STAT SOFTW, V45, P1
[15]  
Jamshidian M, 2014, J STAT SOFTW, V56, P1
[16]   Missing covariate data in medical research: To impute is better than to ignore [J].
Janssen, Kristel J. M. ;
Donders, A. Rogier T. ;
Harrell, Frank E., Jr. ;
Vergouwe, Yvonne ;
Chen, Qingxia ;
Grobbee, Diederick E. ;
Moons, Karel G. M. .
JOURNAL OF CLINICAL EPIDEMIOLOGY, 2010, 63 (07) :721-727
[17]   The impact of missing data on analyses of a time-dependent exposure in a longitudinal cohort: A simulation study [J].
Karahalios A. ;
Baglietto L. ;
Lee K.J. ;
English D.R. ;
Carlin J.B. ;
Simpson J.A. .
Emerging Themes in Epidemiology, 10 (1)
[18]   A review of the reporting and handling of missing data in cohort studies with repeated assessment of exposure measures [J].
Karahalios, Amalia ;
Baglietto, Laura ;
Carlin, John B. ;
English, Dallas R. ;
Simpson, Julie A. .
BMC MEDICAL RESEARCH METHODOLOGY, 2012, 12
[20]   PATTERN-MIXTURE MODELS FOR MULTIVARIATE INCOMPLETE DATA [J].
LITTLE, RJA .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1993, 88 (421) :125-134