Modelling species presence-only data with random forests

被引:110
作者
Valavi, Roozbeh [1 ]
Elith, Jane [1 ]
Lahoz-Monfort, Jose J. [1 ]
Guillera-Arroita, Gurutzeta [1 ]
机构
[1] Univ Melbourne, Sch Biosci, Parkville, Vic, Australia
关键词
class imbalance; class overlap; down-sampling; ecological niche model; presence-background; recursive partitioning; POINT PROCESS MODELS; STATISTICAL COMPARISONS; DECISION TREES; CLASSIFICATION; REGRESSION; BIAS; DISTRIBUTIONS; CLASSIFIERS; PERFORMANCE;
D O I
10.1111/ecog.05615
中图分类号
X176 [生物多样性保护];
学科分类号
090705 ;
摘要
The random forest (RF) algorithm is an ensemble of classification or regression trees and is widely used, including for species distribution modelling (SDM). Many researchers use implementations of RF in the R programming language with default parameters to analyse species presence-only data together with 'background' samples. However, there is good evidence that RF with default parameters does not perform well for such 'presence-background' modelling. This is often attributed to the disparity between the number of presence and background samples, also known as 'class imbalance', and several solutions have been proposed. Here, we first set the context: the background sample should be large enough to represent all environments in the region. We then aim to understand the drivers of poor performance of RF when models are fitted to presence-only species data alongside background samples. We show that 'class overlap' (where both classes occur in the same environment) is an important driver of poor performance, alongside class imbalance. Class overlap can even degrade performance for presence-absence data. We explain, test and evaluate suggested solutions. Using simulated and real presence-background data, we compare performance of default RF with other weighting and sampling approaches. Our results demonstrate clear evidence of improvement in the performance of RFs when techniques that explicitly manage imbalance are used. We show that these either limit or enforce tree depth. Without compromising the environmental representativeness of the sampled background, we identify approaches to fitting RF that ameliorate the effects of imbalance and overlap and allow excellent predictive performance. Understanding the problems of RF in presence-background modelling allows new insights into how best to fit models, and should guide future efforts to best deal with such data.
引用
收藏
页码:1731 / 1742
页数:12
相关论文
共 50 条
  • [41] Inferring prevalence from presence-only data: a response to 'Can we model the probability of presence of species without absence data?'
    Phillips, Steven
    ECOGRAPHY, 2012, 35 (05) : 385 - 387
  • [42] Integrating presence-only and presence-absence data to model changes in species geographic ranges: An example in the Neotropics
    Grattarola, Florencia
    Bowler, Diana E.
    Keil, Petr
    JOURNAL OF BIOGEOGRAPHY, 2023, 50 (09) : 1561 - 1575
  • [43] A joint distribution framework to improve presence-only species distribution models by exploiting opportunistic surveys
    Molgora, Juan M. Escamilla
    Sedda, Luigi
    Diggle, Peter
    Atkinson, Peter M.
    JOURNAL OF BIOGEOGRAPHY, 2022, 49 (06) : 1176 - 1192
  • [44] Estimation of spatial sampling effort based on presence-only data and accessibility
    Fernandez, Daniel
    Nakamura, Miguel
    ECOLOGICAL MODELLING, 2015, 299 : 147 - 155
  • [45] Presence-only modelling using MAXENT: when can we trust the inferences?
    Yackulic, Charles B.
    Chandler, Richard
    Zipkin, Elise F.
    Royle, J. Andrew
    Nichols, James D.
    Grant, Evan H. Campbell
    Veran, Sophie
    METHODS IN ECOLOGY AND EVOLUTION, 2013, 4 (03): : 236 - 243
  • [46] PUlasso: High-Dimensional Variable Selection With Presence-Only Data
    Song, Hyebin
    Raskutti, Garvesh
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2020, 115 (529) : 334 - 347
  • [47] Modeling current and potential distributions of mammal species using presence-only data: A case study on British deer
    Croft, Simon
    Ward, Alastair, I
    Aegerter, James N.
    Smith, Graham C.
    ECOLOGY AND EVOLUTION, 2019, 9 (15): : 8724 - 8735
  • [48] Assessing the Vulnerability of Aquatic Macroinvertebrates to Climate Warming in a Mountainous Watershed: Supplementing Presence-Only Data with Species Traits
    Besacier Monbertrand, Anne-Laure
    Timoner, Pablo
    Rahman, Kazi
    Burlando, Paolo
    Fatichi, Simone
    Gonseth, Yves
    Moser, Frederic
    Castella, Emmanuel
    Lehmann, Anthony
    WATER, 2019, 11 (04)
  • [49] Testing the utility of species distribution modelling using Random Forests for a species in decline
    Burns, Phoebe A.
    Clemann, Nick
    White, Matt
    AUSTRAL ECOLOGY, 2020, 45 (06) : 706 - 716
  • [50] Presence-Only Habitat Suitability Modelling Using Unclassified Landsat ETM plus Imagery: Fine-Resolution Maps for Common Small Mammal Species in Bulgaria
    Popov, Vasil V.
    ACTA ZOOLOGICA BULGARICA, 2015, 67 (01): : 51 - 66