Random Forests for Heteroscedastic Data

被引：0

作者：

Bellamy, Hugo ^{[1
]}

King, Ross D. ^{[1
]}

机构：

[1] Univ Cambridge, Cambridge, England

来源：

DISCOVERY SCIENCE, DS 2024, PT II | 2025年 / 15244卷

关键词：

Random forests; Noise; REGRESSION;

D O I：

10.1007/978-3-031-78980-9_3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Random forests are a popular machine learning technique that are effective across a range of scientific problems. We extend the standard algorithm to incorporate the uncertainty information that arises in heteroscedastic data - datasets where the amount of noise in the target value varies between datapoints. We consider datasets where the relative amount of measurement noise in different datapoints is known. This is not the standard scenario, but does commonly exist in real data, as we illustrate on 10 drug design datasets. Utilising this uncertainty information can lead to significantly better predictive performance. We introduce three random forest variations to learn from heteroscedastic data: parametric bootstrapping, weighted random forests and variable output smearing. All three can improve model performance, demonstrating the adaptability of random forests to heteroscedastic data and thus expanding their applicability. Additionally, variations in the relative performance of the three methods across datasets provides insight into the mechanisms of random forests and the purpose of the different random elements within the model.

引用

页码：34 / 49

页数：16

共 31 条

[1] Ali J., 2012, IJCSI, V9, P272
[2] [Anonymous], 28. Notch 1 [ Mus musculus (house mouse)]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information
[3] 2023 - [cited 2023 07 07]. Available from: https://www.ncbi.nlm.nih.gov/gene/18128
[4] Biau G, 2016, TEST-SPAIN, V25, P197, DOI 10.1007/s11749-016-0481-7
[5] Random forests
Breiman, L
[J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
[6] Randomizing outputs to increase prediction accuracy
Breiman, L
[J]. MACHINE LEARNING, 2000, 40 (03) : 229 - 242
[7] Automatic selection of molecular descriptors using random forest: Application to drug discovery
Cano, Gaspar
Garcia-Rodriguez, Jose
Garcia-Garcia, Alberto
Perez-Sanchez, Horacio
Benediktsson, Jon Atli
Thapa, Anil
Barr, Alastair
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2017, 72 : 151 - 159
[8] The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation
Chicco, Davide
Warrens, Matthijs J.
Jurman, Giuseppe
[J]. PEERJ COMPUTER SCIENCE, 2021,
[9] Ensemble methods in machine learning
Dietterich, TG
[J]. MULTIPLE CLASSIFIER SYSTEMS, 2000, 1857 : 1 - 15
[10] MULTIVARIATE ADAPTIVE REGRESSION SPLINES
FRIEDMAN, JH
[J]. ANNALS OF STATISTICS, 1991, 19 (01) : 1 - 67

← 1 2 3 4 →