Spatial thinning and class balancing: Key choices lead to variation in the performance of species distribution models with citizen science data

被引:80
作者
Steen, Valerie A. [1 ,2 ]
Tingley, Morgan W. [1 ,3 ]
Paton, Peter W. C. [2 ]
Elphick, Chris S. [1 ]
机构
[1] Univ Connecticut, Ecol & Evolutionary Biol, Storrs, CT 06269 USA
[2] Univ Rhode Isl, Dept Nat Resources Sci, Kingston, RI 02881 USA
[3] Univ Calif Los Angeles, Ecol & Evolutionary Biol, Los Angeles, CA USA
来源
METHODS IN ECOLOGY AND EVOLUTION | 2021年 / 12卷 / 02期
基金
美国国家科学基金会;
关键词
class balancing; eBird; occurrence data; presence– absence data; prevalence; spatial thinning; SAMPLING BIAS; PRECISION-RECALL; PREVALENCE; PREDICTION; ACCURACY; IMPROVE; THRESHOLDS; VALIDATION; CURVES; REDUCE;
D O I
10.1111/2041-210X.13525
中图分类号
Q14 [生态学(生物生态学)];
学科分类号
071012 ; 0713 ;
摘要
Spatial biases are a common feature of presence-absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non-detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modelling technique. To explore the consequences of spatial bias and class imbalance in presence-absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing and majority-only thinning (i.e. retaining all samples of the minority class). We created SDMs using two parametric or semi-parametric techniques (generalized linear models and generalized additive models) and two machine learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision-recall curve) and calibration (Brier score; Cohen's kappa) metrics. We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modelling technique, performance metric and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence <0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes using machine learning techniques, but typically hindered model calibration. Baseline sample prevalence, sample size, modelling approach and the intended application of SDM output-whether discrimination or calibration-should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis-a-vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.
引用
收藏
页码:216 / 226
页数:11
相关论文
共 59 条
[1]   Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS) [J].
Allouche, Omri ;
Tsoar, Asaf ;
Kadmon, Ronen .
JOURNAL OF APPLIED ECOLOGY, 2006, 43 (06) :1223-1232
[2]   Species-specific tuning increases robustness to sampling bias in models of species distributions: An implementation with Maxent [J].
Anderson, Robert P. ;
Gonzalez, Israel, Jr. .
ECOLOGICAL MODELLING, 2011, 222 (15) :2796-2811
[3]   The effect of the extent of the study region on GIS models of species geographic distributions and estimates of niche evolution: preliminary tests with montane rodents (genus Nephelomys) in Venezuela [J].
Anderson, Robert P. ;
Raza, Ali .
JOURNAL OF BIOGEOGRAPHY, 2010, 37 (07) :1378-1393
[4]   Selecting pseudo-absences for species distribution models: how, where and how many? [J].
Barbet-Massin, Morgane ;
Jiguet, Frederic ;
Albert, Cecile Helene ;
Thuiller, Wilfried .
METHODS IN ECOLOGY AND EVOLUTION, 2012, 3 (02) :327-338
[5]   Statistical solutions for error and bias in global citizen science datasets [J].
Bird, Tomas J. ;
Bates, Amanda E. ;
Lefcheck, Jonathan S. ;
Hill, Nicole A. ;
Thomson, Russell J. ;
Edgar, Graham J. ;
Stuart-Smith, Rick D. ;
Wotherspoon, Simon ;
Krkosek, Martin ;
Stuart-Smith, Jemina F. ;
Pecl, Gretta T. ;
Barrett, Neville ;
Frusher, Stewart .
BIOLOGICAL CONSERVATION, 2014, 173 :144-154
[6]   Spatial filtering to reduce sampling bias can improve the performance of ecological niche models [J].
Boria, Robert A. ;
Olson, Link E. ;
Goodman, Steven M. ;
Anderson, Robert P. .
ECOLOGICAL MODELLING, 2014, 275 :73-77
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]   Contribution of citizen science towards international biodiversity monitoring [J].
Chandler, Mark ;
See, Linda ;
Copas, Kyle ;
Bonde, Astrid M. Z. ;
Lopez, Bernat Claramunt ;
Danielsen, Finn ;
Legind, Jan Kristoffer ;
Masinde, Siro ;
Miller-Rushing, Abraham J. ;
Newman, Greg ;
Rosemartin, Alyssa ;
Turak, Eren .
BIOLOGICAL CONSERVATION, 2017, 213 :280-294
[9]   Validation and calibration of probabilistic predictions in ecology [J].
Chivers, Corey ;
Leung, Brian ;
Yan, Norman D. .
METHODS IN ECOLOGY AND EVOLUTION, 2014, 5 (10) :1023-1032
[10]   The derivation of species response curves with Gaussian logistic regression is sensitive to sampling intensity and curve characteristics [J].
Coudun, Christophe ;
Gegout, Jean-Claude .
ECOLOGICAL MODELLING, 2006, 199 (02) :164-175