Effects of random forest modeling decisions on biogeochemical time series predictions

被引:12
作者
Regier, Peter [1 ]
Duggan, Matthew [1 ,2 ,3 ]
Myers-Pigg, Allison [1 ,4 ,5 ]
Ward, Nicholas [1 ,6 ]
机构
[1] Pacific Northwest Natl Lab, Marine & Coastal Res Lab, Sequim, WA 98382, Australia
[2] Cornell Univ, Cornell Lab Ornithol, K Lisa Yang Ctr Conservat Bioacoust, Ithaca, NY USA
[3] Cornell Univ, Dept Nat Resources & Environm, Ithaca, NY USA
[4] Pacific Northwest Natl Lab, Biol Sci Div, Richland, WA 99352 USA
[5] Univ Toledo, Dept Environm Sci, 2801 W Bancroft St, Toledo, OH 43606 USA
[6] Univ Washington, Sch Oceanog, Seattle, WA 98195 USA
关键词
HIGH-FREQUENCY WAVE; LIMITATION; SELECTION; DYNAMICS; SENSORS; NITRATE;
D O I
10.1002/lom3.10523
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Random forests (RF) are an increasingly popular machine learning approach used to model biogeochemical processes in the Earth system. While RF models are robust to many assumptions that complicate deterministic models, there are several important parameterization decisions for appropriate use and optimal model fit. We explored the role that parameter decisions, including training/testing data splitting strategies, variable selection, and hyperparameters play on RF goodness-of-fit by constructing models using 1296 unique parameter combinations to predict concentrations of nitrate, a key nutrient for biogeochemical cycling in aquatic ecosystems. Models were built on long-term, publicly available water quality and meteorology time series collected by the National Estuarine Research Reserve monitoring network for two contrasting ecosystems representing freshwater and brackish estuaries. We found that accounting for temporal dependence when splitting data into training and testing subsets was key for avoiding over-estimation of model predictive power. In addition, variable selection, the ratio of training to testing data, and to a lesser degree, variables per split and number of trees, were significant parameters for optimizing RF goodness-of-fit. We also explored how model parameter decisions influenced interpretation of the relative importance of predictors to the model, and model predictor-dependent variable relationships, with results suggesting that both data structure and model parameterization influence these factors. Because much of the current RF literature is written for the computational and statistical science communities, the primary goal of this study is to provide guidelines for aquatic scientists new to machine learning to apply RF techniques appropriately to aquatic biogeochemical datasets.
引用
收藏
页码:40 / 52
页数:13
相关论文
共 57 条
[1]   Factors driving nutrient trends in streams of the Chesapeake Bay watershed [J].
Ator, Scott W. ;
Blomquist, Joel D. ;
Webber, James S. ;
Chanat, Jeffrey G. .
JOURNAL OF ENVIRONMENTAL QUALITY, 2020, 49 (04) :812-834
[2]   Wildfires increasingly impact western US fluvial networks [J].
Ball, Grady ;
Regier, Peter ;
Gonzalez-Pinzon, Ricardo ;
Reale, Justin ;
Van Horn, David .
NATURE COMMUNICATIONS, 2021, 12 (01)
[3]  
Basu S, 2017, Arxiv, DOI [arXiv:1706.08457, 10.48550/arXiv.1706.08457]
[4]  
Bisht A.K., 2018, SMART INNOVATIVE TRE, P190, DOI [10.1007/978-981-10-8657-1_15, DOI 10.1007/978-981-10-8657-1_15]
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]   Random forests for high-dimensional longitudinal data [J].
Capitaine, Louis ;
Genuer, Robin ;
Thiebaut, Rodolphe .
STATISTICAL METHODS IN MEDICAL RESEARCH, 2021, 30 (01) :166-184
[7]   Estimation of high frequency nutrient concentrations from water quality surrogates using machine learning methods [J].
Castrillo, Maria ;
Lopez Garcia, Alvaro .
WATER RESEARCH, 2020, 172
[8]   Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data [J].
Chen, Kangyang ;
Chen, Hexia ;
Zhou, Chuanlong ;
Huang, Yichao ;
Qi, Xiangyang ;
Shen, Ruqin ;
Liu, Fengrui ;
Zuo, Min ;
Zou, Xinyi ;
Wang, Jinfeng ;
Zhang, Yan ;
Chen, Da ;
Chen, Xingguo ;
Deng, Yongfeng ;
Ren, Hongqiang .
WATER RESEARCH, 2020, 171
[9]   Predicting dissolved organic carbon concentration in a dynamic salt marsh creek via machine learning [J].
Codden, Christina J. ;
Snauffer, Andrew M. ;
Mueller, Amy V. ;
Edwards, Catherine R. ;
Thompson, Megan ;
Tait, Zachary ;
Stubbins, Aron .
LIMNOLOGY AND OCEANOGRAPHY-METHODS, 2021, 19 (02) :81-95
[10]   Evaluation of variable selection methods for random forests and omics data sets [J].
Degenhardt, Frauke ;
Seifert, Stephan ;
Szymczak, Silke .
BRIEFINGS IN BIOINFORMATICS, 2019, 20 (02) :492-503