Diversity Forests: Using Split Sampling to Enable Innovative Complex Split Procedures in Random Forests

被引:0
作者
Roman Hornung
机构
[1] Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich
关键词
Classification; Decision trees; Ensemble learning; Random forests;
D O I
10.1007/s42979-021-00920-1
中图分类号
学科分类号
摘要
The diversity forest algorithm is an alternative candidate node split sampling scheme that makes innovative complex split procedures in random forests possible. While conventional univariable, binary splitting suffices for obtaining strong predictive performance, new complex split procedures can help tackling practically important issues. For example, interactions between features can be exploited effectively by bivariable splitting. With diversity forests, each split is selected from a candidate split set that is sampled in the following way: for l=1,⋯,nsplits: (1) sample one split problem; (2) sample a single or few splits from the split problem sampled in (1) and add this or these splits to the candidate split set. The split problems are specifically structured collections of splits that depend on the respective split procedure considered. This sampling scheme makes innovative complex split procedures computationally tangible while avoiding overfitting. Important general properties of the diversity forest algorithm are evaluated empirically using univariable, binary splitting. Based on 220 data sets with binary outcomes, diversity forests are compared with conventional random forests and random forests using extremely randomized trees. It is seen that the split sampling scheme of diversity forests does not impair the predictive performance of random forests and that the performance is quite robust with regard to the specified nsplits value. The recently developed interaction forests are the first diversity forest method that uses a complex split procedure. Interaction forests allow modeling and detecting interactions between features effectively. Further potential complex split procedures are discussed as an outlook. © The Author(s) 2021.
引用
收藏
相关论文
共 33 条
[1]  
Bertsimas D., Dunn J., Optimal classification trees, Mach Learn, 106, pp. 1039-1082, (2017)
[2]  
Berzal F., Cubero J.C., Marin N., Sanchez D., Building multi-way decision trees with numerical attributes, Inf Sci, 165, 1-2, pp. 73-90, (2004)
[3]  
Breiman L., Out-of-bag estimation. Technical report, Department of Statistics, (1996)
[4]  
Breiman L., Random forests, Mach Learn, 45, 1, pp. 5-32, (2001)
[5]  
Breiman L., Friedman J.H., Olshen R.A., Ston C.J., Classification and regression trees, (1984)
[6]  
Brodley C.E., Utgoff P.E., Multivariate decision trees, Mach Learn, 19, pp. 45-77, (1995)
[7]  
Calhoun P., Hallett M.J., Su X., Cafri G., Levine R.A., Fan J., Random forest with acceptance-rejection trees, Comput Stat, (2019)
[8]  
Cobb J.S., Seale M.A., Examining the effect of social distancing on the compound growth rate of COVID-19 at the county level (united states) using statistical analyses and a random forest machine learning model, Public Health, 185, pp. 27-29, (2020)
[9]  
Couronne R., Probst P., Boulesteix A.L., Random forest versus logistic regression: a large-scale benchmark experiment, BMC Bioinform, 19, (2018)
[10]  
Fayyad U.M., Irani K.B., Multi-interval discretization of continuous-valued attributes for classification learning, Proceedings of the Thirteenth International Join Conference on Artificial Intelligence, pp. 1022-1027