On the overestimation of random forest's out-of-bag error

被引:153
作者
Janitza, Silke [1 ]
Hornung, Roman [1 ]
机构
[1] Univ Munich, Inst Med Informat Proc Biometry & Epidemiol, Munich, Germany
来源
PLOS ONE | 2018年 / 13卷 / 08期
关键词
PREDICTION; CLASSIFICATION; TUMOR; DISCOVERY; PATTERNS; IMPACT; CANCER;
D O I
10.1371/journal.pone.0201904
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.
引用
收藏
页数:31
相关论文
共 46 条
  • [21] Discovery of agents that eradicate leukemia stem cells using an in silico screen of public gene expression data
    Hassane, Duane C.
    Guzman, Monica L.
    Corbett, Cheryl
    Li, Xiaojie
    Abboud, Ramzi
    Young, Fay
    Liesveld, Jane L.
    Carroll, Martin
    Jordan, Craig T.
    [J]. BLOOD, 2008, 111 (12) : 5654 - 5662
  • [22] A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization
    Hornung, Roman
    Bernau, Christoph
    Truntzer, Caroline
    Wilson, Rory
    Stadler, Thomas
    Boulesteix, Anne-Laure
    [J]. BMC MEDICAL RESEARCH METHODOLOGY, 2015, 15
  • [23] Unbiased recursive partitioning: A conditional inference framework
    Hothorn, Torsten
    Hornik, Kurt
    Zeileis, Achim
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2006, 15 (03) : 651 - 674
  • [24] Kim DS, 2006, BUILDING LIGHTWEIGHT, P224
  • [25] Classification trees with unbiased multiway splits
    Kim, H
    Loh, WY
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (454) : 589 - 604
  • [26] Combined genomic expressions as a diagnostic factor for oral squamous cell carcinoma
    Kim, Ki-Yeol
    Zhang, Xianglan
    Cha, In-Ho
    [J]. GENOMICS, 2014, 103 (5-6) : 317 - 322
  • [27] A random forest approach for predicting the presence of Echinococcus multilocularis intermediate host Ochotona spp. presence in relation to landscape characteristics in western China
    Marston, Christopher G.
    Danson, F. Mark
    Armitage, Richard P.
    Giraudoux, Patrick
    Pleydell, David R. J.
    Wang, Qian
    Qui, Jiamin
    Craig, Philip S.
    [J]. APPLIED GEOGRAPHY, 2014, 55 : 176 - 183
  • [28] Mellish CS, 1995, Proceedings of the 14th International Joint Conference on Artificial Intelligence, V20-25, P1137
  • [29] Mitchell M.W., 2011, OPEN J STAT, V1, P205, DOI [DOI 10.4236/OJS.2011.13024, 10.4236/ojs.2011.13024]
  • [30] Prediction error estimation: a comparison of resampling methods
    Molinaro, AM
    Simon, R
    Pfeiffer, RM
    [J]. BIOINFORMATICS, 2005, 21 (15) : 3301 - 3307