Machine Learning in Environmental Research: Common Pitfalls and Best Practices

被引:258
作者
Zhu, Jun-Jie [1 ,2 ]
Yang, Meiqi [1 ,2 ]
Ren, Zhiyong Jason [1 ,2 ]
机构
[1] Princeton Univ, Dept Civil & Environm Engn, Princeton, NJ 08544 USA
[2] Princeton Univ, Andlinger Ctr Energy & Environm, Princeton, NJ 08544 USA
关键词
Machine learning; supervised learning; environmentalresearch; data preprocessing; data leakage; hyperparameter optimization; model explainability; causality; ARTIFICIAL NEURAL-NETWORK; RANDOM FOREST; PM2.5; CONCENTRATIONS; SPATIOTEMPORAL PREDICTION; LANDSLIDE SUSCEPTIBILITY; FLOOD SUSCEPTIBILITY; NATIONAL SCALE; MODELS; WASTE; CHINA;
D O I
10.1021/acs.est.3c00026
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Machine learning (ML) is increasinglyused in environmentalresearchto process large data sets and decipher complex relationships betweensystem variables. However, due to the lack of familiarity and methodologicalrigor, inadequate ML studies may lead to spurious conclusions. Inthis study, we synthesized literature analysis with our own experienceand provided a tutorial-like compilation of common pitfalls alongwith best practice guidelines for environmental ML research. We identifiedmore than 30 key items and provided evidence-based data analysis basedon 148 highly cited research articles to exhibit the misconceptionsof terminologies, proper sample size and feature size, data enrichmentand feature selection, randomness assessment, data leakage management,data splitting, method selection and comparison, model optimizationand evaluation, and model explainability and causality. By analyzinggood examples on supervised learning and reference modeling paradigms,we hope to help researchers adopt more rigorous data preprocessingand model development standards for more accurate, robust, and practicablemodel uses in environmental research and applications.
引用
收藏
页码:17671 / 17689
页数:19
相关论文
共 123 条
[1]   Modeling and optimization of biogas production from a waste digester using artificial neural network and genetic algorithm [J].
Abu Qdais, H. ;
Hani, K. Bani ;
Shatnawi, N. .
RESOURCES CONSERVATION AND RECYCLING, 2010, 54 (06) :359-363
[2]   Artificial neural network modeling in competitive adsorption of phenol and resorcinol from water environment using some carbonaceous adsorbents [J].
Aghav, R. M. ;
Kumar, Sunil ;
Mukherjee, S. N. .
JOURNAL OF HAZARDOUS MATERIALS, 2011, 188 (1-3) :67-77
[3]   Predictive modelling for solar thermal energy systems: A comparison of support vector regression, random forest, extra trees and regression trees [J].
Ahmad, Muhammad Waseem ;
Reynolds, Jonathan ;
Rezgui, Yacine .
JOURNAL OF CLEANER PRODUCTION, 2018, 203 :810-821
[4]   Estimating future burned areas under changing climate in the EU-Mediterranean countries [J].
Amatulli, Giuseppe ;
Camia, Andrea ;
San-Miguel-Ayanz, Jesus .
SCIENCE OF THE TOTAL ENVIRONMENT, 2013, 450 :209-222
[5]  
[Anonymous], 2016, Introducing Machine Learning
[6]   Forecasting of groundwater level fluctuations using ensemble hybrid multi-wavelet neural network-based models [J].
Barzegar, Rahim ;
Fijani, Elham ;
Moghaddam, Asghar Asghari ;
Tziritis, Evangelos .
SCIENCE OF THE TOTAL ENVIRONMENT, 2017, 599 :20-31
[7]   A Hybrid Approach to Estimating National Scale Spatiotemporal Variability of PM2.5 in the Contiguous United States [J].
Beckerman, Bernardo S. ;
Jerrett, Michael ;
Serre, Marc ;
Martin, Randall V. ;
Lee, Seung-Jae ;
van Donkelaar, Aaron ;
Ross, Zev ;
Su, Jason ;
Burnett, Richard T. .
ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2013, 47 (13) :7233-7241
[8]  
Biderman S, 2021, Arxiv, DOI arXiv:2011.02832
[9]   Predicting Daily Urban Fine Particulate Matter Concentrations Using a Random Forest Model [J].
Brokamp, Cole ;
Jandarov, Roman ;
Hossain, Monir ;
Ryan, Patrick .
ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2018, 52 (07) :4173-4179
[10]   Thomas and artificial neural network models for the fixed-bed adsorption of methylene blue by a beach waste Posidonia oceanica (L.) dead leaves [J].
Cavas, Levent ;
Karabay, Zeynelabidin ;
Alyuruk, Hakan ;
Dogan, Hatice ;
Demir, Guleser K. .
CHEMICAL ENGINEERING JOURNAL, 2011, 171 (02) :557-562