Random Forests for Heteroscedastic Data

被引:0
作者
Bellamy, Hugo [1 ]
King, Ross D. [1 ]
机构
[1] Univ Cambridge, Cambridge, England
来源
DISCOVERY SCIENCE, DS 2024, PT II | 2025年 / 15244卷
关键词
Random forests; Noise; REGRESSION;
D O I
10.1007/978-3-031-78980-9_3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Random forests are a popular machine learning technique that are effective across a range of scientific problems. We extend the standard algorithm to incorporate the uncertainty information that arises in heteroscedastic data - datasets where the amount of noise in the target value varies between datapoints. We consider datasets where the relative amount of measurement noise in different datapoints is known. This is not the standard scenario, but does commonly exist in real data, as we illustrate on 10 drug design datasets. Utilising this uncertainty information can lead to significantly better predictive performance. We introduce three random forest variations to learn from heteroscedastic data: parametric bootstrapping, weighted random forests and variable output smearing. All three can improve model performance, demonstrating the adaptability of random forests to heteroscedastic data and thus expanding their applicability. Additionally, variations in the relative performance of the three methods across datasets provides insight into the mechanisms of random forests and the purpose of the different random elements within the model.
引用
收藏
页码:34 / 49
页数:16
相关论文
共 31 条
  • [1] Ali J., 2012, IJCSI, V9, P272
  • [2] [Anonymous], 28. Notch 1 [ Mus musculus (house mouse)]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information
  • [3] 2023 - [cited 2023 07 07]. Available from: https://www.ncbi.nlm.nih.gov/gene/18128
  • [4] Biau G, 2016, TEST-SPAIN, V25, P197, DOI 10.1007/s11749-016-0481-7
  • [5] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [6] Randomizing outputs to increase prediction accuracy
    Breiman, L
    [J]. MACHINE LEARNING, 2000, 40 (03) : 229 - 242
  • [7] Automatic selection of molecular descriptors using random forest: Application to drug discovery
    Cano, Gaspar
    Garcia-Rodriguez, Jose
    Garcia-Garcia, Alberto
    Perez-Sanchez, Horacio
    Benediktsson, Jon Atli
    Thapa, Anil
    Barr, Alastair
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2017, 72 : 151 - 159
  • [8] The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation
    Chicco, Davide
    Warrens, Matthijs J.
    Jurman, Giuseppe
    [J]. PEERJ COMPUTER SCIENCE, 2021,
  • [9] Ensemble methods in machine learning
    Dietterich, TG
    [J]. MULTIPLE CLASSIFIER SYSTEMS, 2000, 1857 : 1 - 15
  • [10] MULTIVARIATE ADAPTIVE REGRESSION SPLINES
    FRIEDMAN, JH
    [J]. ANNALS OF STATISTICS, 1991, 19 (01) : 1 - 67