Correcting for the effects of class imbalance improves the performance of machine-learning based species distribution models

被引:18
作者
Benkendorf, Donald J. [1 ,2 ]
Schwartz, Samuel D. [3 ]
Cutler, D. Richard [4 ]
Hawkins, Charles P. [1 ,2 ]
机构
[1] Utah State Univ, Ecol Ctr, Dept Watershed Sci, 5210 Old Main Hill, Logan, UT 84322 USA
[2] Utah State Univ, Natl Aquat Monitoring Ctr, 5210 Old Main Hill, Logan, UT 84322 USA
[3] Univ Oregon, Dept Comp Sci, Eugene, OR 97403 USA
[4] Utah State Univ, Dept Math & Stat, Logan, UT 84322 USA
基金
美国国家科学基金会;
关键词
Species distribution models; Class imbalance; Prevalence; Machine-learning; Aquatic macroinvertebrates; ARTIFICIAL NEURAL-NETWORKS; SUPPORT VECTOR MACHINES; MACROINVERTEBRATE FAUNA; BIOLOGICAL INTEGRITY; CLIMATE-CHANGE; CLASSIFICATION; PREVALENCE; PREDICTION; RESPONSES; BIODIVERSITY;
D O I
10.1016/j.ecolmodel.2023.110414
中图分类号
Q14 [生态学(生物生态学)];
学科分类号
071012 ; 0713 ;
摘要
Numerous methods have been developed to combat the unwanted effects of imbalanced training data on the performance of machine-learning based predictive models. These methods attempt to balance model sensitivity and specificity. However, the effects of specific imbalance-correction methods on the performance of different machine-learning algorithms are not well understood for ecological data. In this study, we used four machinelearning algorithms (random forest, artificial neural network, gradient boosting, support vector machine) and five imbalance-correction methods (base algorithm = no correction, cutoff, up-sampling, down-sampling, weighting) to produce species distribution models for 15 freshwater macroinvertebrate genera that varied from 2.5 to 29.0% in prevalence. All imbalance-correction methods substantially improved average model performance (true skill statistic) over the base machine-learning algorithms, except when up-sampling was applied to random forest models. Choice of machine-learning algorithm had little effect on model performance, although gradient boosting performed better than other algorithms on the most imbalanced datasets. Our results suggest that the performance of species distribution models built with presence/absence data can generally be improved by correcting for imbalanced data.
引用
收藏
页数:14
相关论文
共 98 条
[1]   Choice of climate data affects the performance and interpretation of species distribution models [J].
Abdulwahab, Umarfarooq A. ;
Hammill, Edd ;
Hawkins, Charles P. .
ECOLOGICAL MODELLING, 2022, 471
[2]   Applying support vector machines to imbalanced datasets [J].
Akbani, R ;
Kwek, S ;
Japkowicz, N .
MACHINE LEARNING: ECML 2004, PROCEEDINGS, 2004, 3201 :39-50
[3]  
Akosa J., 2017, P SAS GLOB FOR SAC I, V12, P1
[4]   Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS) [J].
Allouche, Omri ;
Tsoar, Asaf ;
Kadmon, Ronen .
JOURNAL OF APPLIED ECOLOGY, 2006, 43 (06) :1223-1232
[5]   When and how should biotic interactions be considered in models of species niches and distributions? [J].
Anderson, Robert P. .
JOURNAL OF BIOGEOGRAPHY, 2017, 44 (01) :8-17
[6]   HyDiaD: A hybrid species distribution model combining dispersal, multi-habitat suitability, and population dynamics for diadromous species under climate change scenarios [J].
Barber-O'Malley, Betsy ;
Lassalle, Geraldine ;
Chust, Guillem ;
Diaz, Estibaliz ;
O'Malley, Andrew ;
Blazquez, Cesar Paradinas ;
Marquina, Javier Portoles ;
Lambert, Patrick .
ECOLOGICAL MODELLING, 2022, 470
[7]   Species distribution models: Administrative boundary centroid occurrences require careful interpretation [J].
Barker, Justin R. ;
MacIsaac, Hugh J. .
ECOLOGICAL MODELLING, 2022, 472
[8]   Water quality variables and pollution sources shaping stream macroinvertebrate communities [J].
Berger, Elisabeth ;
Haase, Peter ;
Kuemmerlen, Mathias ;
Leps, Moritz ;
Schaefer, Ralf Bernhard ;
Sundermann, Andrea .
SCIENCE OF THE TOTAL ENVIRONMENT, 2017, 587 :1-10
[9]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[10]   An experimental comparison of classification algorithms for imbalanced credit scoring data sets [J].
Brown, Iain ;
Mues, Christophe .
EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (03) :3446-3453