A new variable importance measure for random forests with missing data

被引:145
作者
Hapfelmeier, Alexander [1 ]
Hothorn, Torsten [2 ]
Ulm, Kurt [1 ]
Strobl, Carolin [3 ]
机构
[1] Tech Univ Munich, Inst Med Stat & Epidemiol, D-81675 Munich, Germany
[2] Univ Munich, Inst Stat, D-80539 Munich, Germany
[3] Univ Zurich, Dept Psychol, CH-8050 Zurich, Switzerland
关键词
Variable importance measures; Permutation importance; Random forests; Missing values; Missing data; CLASSIFICATION TREES; SELECTION; INFERENCE; BIAS;
D O I
10.1007/s11222-012-9349-1
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Random forests are widely used in many research fields for prediction and interpretation purposes. Their popularity is rooted in several appealing characteristics, such as their ability to deal with high dimensional data, complex interactions and correlations between variables. Another important feature is that random forests provide variable importance measures that can be used to identify the most important predictor variables. Though there are alternatives like complete case analysis and imputation, existing methods for the computation of such measures cannot be applied straightforward when the data contains missing values. This paper presents a solution to this pitfall by introducing a new variable importance measure that is applicable to any kind of data-whether it does or does not contain missing values. An extensive simulation study shows that the new measure meets sensible requirements and shows good variable ranking properties. An application to two real data sets also indicates that the new approach may provide a more sensible variable ranking than the widespread complete case analysis. It takes the occurrence of missing values into account which makes results also differ from those obtained under multiple imputation.
引用
收藏
页码:21 / 34
页数:14
相关论文
共 50 条
  • [1] SLEEP IN MAMMALS - ECOLOGICAL AND CONSTITUTIONAL CORRELATES
    ALLISON, T
    CICCHETTI, DV
    [J]. SCIENCE, 1976, 194 (4266) : 732 - 734
  • [2] Permutation importance: a corrected feature importance measure
    Altmann, Andre
    Tolosi, Laura
    Sander, Oliver
    Lengauer, Thomas
    [J]. BIOINFORMATICS, 2010, 26 (10) : 1340 - 1347
  • [3] [Anonymous], 2008, RR6729 INRIA
  • [4] [Anonymous], 2014, C4. 5: programs for machine learning
  • [5] Empirical characterization of random forest variable importance measures
    Archer, Kelfie J.
    Kirnes, Ryan V.
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (04) : 2249 - 2260
  • [6] Biau G, 2008, J MACH LEARN RES, V9, P2015
  • [7] SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
    Blewitt, Marnie E.
    Gendrel, Anne-Valerie
    Pang, Zhenyi
    Sparrow, Duncan B.
    Whitelaw, Nadia
    Craig, Jeffrey M.
    Apedaile, Anwyn
    Hilton, Douglas J.
    Dunwoodie, Sally L.
    Brockdorff, Neil
    Kay, Graham F.
    Whitelaw, Emma
    [J]. NATURE GENETICS, 2008, 40 (05) : 663 - 669
  • [8] Boulesteix AL, 2008, CANCER INFORM, V6, P77
  • [9] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [10] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32