Mining data with random forests: current options for real-world applications

被引:195
作者
Ziegler, Andreas [1 ,2 ]
Koenig, Inke R. [1 ]
机构
[1] Med Univ Lubeck, Inst Med Biometrie & Stat, Univ Klinikum Schleswig Holstein, D-23538 Lubeck, Germany
[2] Med Univ Lubeck, Zentrum Klin Studien, D-23538 Lubeck, Germany
关键词
VARIABLE IMPORTANCE MEASURES; CLASSIFICATION; REGRESSION; ALGORITHMS; SELECTION; LINKAGE; TREES; SNPS;
D O I
10.1002/widm.1114
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Random Forests are fast, flexible, and represent a robust approach to mining high-dimensional data. They are an extension of classification and regression trees (CART). They perform well even in the presence of a large number of features and a small number of observations. In analogy to CART, random forests can deal with continuous outcome, categorical outcome, and time-to-event outcome with censoring. The tree-building process of random forests implicitly allows for interaction between features and high correlation between features. Approaches are available to measuring variable importance and reducing the number of features. Although random forests perform well in many applications, their theoretical properties are not fully understood. Recently, several articles have provided a better understanding of random forests, and we summarize these findings. We survey different versions of random forests, including random forests for classification, random forests for probability estimation, and random forests for estimating survival data. We discuss the consequences of (1) no selection, (2) random selection, and (3) a combination of deterministic and random selection of features for random forests. Finally, we review a backward elimination and a forward procedure, the determination of trees representing a forest, and the identification of important variables in a random forest. Finally, we provide a brief overview of different areas of application of random forests. (C) 2013 John Wiley & Sons, Ltd.
引用
收藏
页码:55 / 63
页数:9
相关论文
共 48 条
[1]  
[Anonymous], ARXIV08113619
[2]   Identifying representative trees from ensembles [J].
Banerjee, Mousumi ;
Ding, Ying ;
Noone, Anne-Michelle .
STATISTICS IN MEDICINE, 2012, 31 (15) :1601-1616
[3]  
Biau G, 2012, J MACH LEARN RES, V13, P1063
[4]   On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification [J].
Biau, Gerard ;
Devroye, Luc .
JOURNAL OF MULTIVARIATE ANALYSIS, 2010, 101 (10) :2499-2518
[5]  
Biau G, 2008, J MACH LEARN RES, V9, P2015
[6]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[7]   Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics [J].
Boulesteix, Anne-Laure ;
Janitza, Silke ;
Kruppa, Jochen ;
Koenig, Inke R. .
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2012, 2 (06) :493-507
[8]   Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations [J].
Boulesteix, Anne-Laure ;
Bender, Andreas ;
Bermejo, Justo Lorenzo ;
Strobl, Carolin .
BRIEFINGS IN BIOINFORMATICS, 2012, 13 (03) :292-304
[9]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[10]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32