Modelling species presence-only data with random forests

被引:110
作者
Valavi, Roozbeh [1 ]
Elith, Jane [1 ]
Lahoz-Monfort, Jose J. [1 ]
Guillera-Arroita, Gurutzeta [1 ]
机构
[1] Univ Melbourne, Sch Biosci, Parkville, Vic, Australia
关键词
class imbalance; class overlap; down-sampling; ecological niche model; presence-background; recursive partitioning; POINT PROCESS MODELS; STATISTICAL COMPARISONS; DECISION TREES; CLASSIFICATION; REGRESSION; BIAS; DISTRIBUTIONS; CLASSIFIERS; PERFORMANCE;
D O I
10.1111/ecog.05615
中图分类号
X176 [生物多样性保护];
学科分类号
090705 ;
摘要
The random forest (RF) algorithm is an ensemble of classification or regression trees and is widely used, including for species distribution modelling (SDM). Many researchers use implementations of RF in the R programming language with default parameters to analyse species presence-only data together with 'background' samples. However, there is good evidence that RF with default parameters does not perform well for such 'presence-background' modelling. This is often attributed to the disparity between the number of presence and background samples, also known as 'class imbalance', and several solutions have been proposed. Here, we first set the context: the background sample should be large enough to represent all environments in the region. We then aim to understand the drivers of poor performance of RF when models are fitted to presence-only species data alongside background samples. We show that 'class overlap' (where both classes occur in the same environment) is an important driver of poor performance, alongside class imbalance. Class overlap can even degrade performance for presence-absence data. We explain, test and evaluate suggested solutions. Using simulated and real presence-background data, we compare performance of default RF with other weighting and sampling approaches. Our results demonstrate clear evidence of improvement in the performance of RFs when techniques that explicitly manage imbalance are used. We show that these either limit or enforce tree depth. Without compromising the environmental representativeness of the sampled background, we identify approaches to fitting RF that ameliorate the effects of imbalance and overlap and allow excellent predictive performance. Understanding the problems of RF in presence-background modelling allows new insights into how best to fit models, and should guide future efforts to best deal with such data.
引用
收藏
页码:1731 / 1742
页数:12
相关论文
共 50 条
  • [21] Assessing the risks and opportunities of presence-only data for conservation planning
    Hermoso, Virgilio
    Kennard, Mark J.
    Linke, Simon
    JOURNAL OF BIOGEOGRAPHY, 2015, 42 (02) : 218 - 228
  • [22] Improving the estimation of the Boyce index using statistical smoothing methods for evaluating species distribution models with presence-only data
    Liu, Canran
    Newell, Graeme
    White, Matt
    Machunter, Josephine
    ECOGRAPHY, 2025, 2025 (01)
  • [23] Accounting for imperfect detection and survey bias in statistical analysis of presence-only data
    Dorazio, Robert M.
    GLOBAL ECOLOGY AND BIOGEOGRAPHY, 2014, 23 (12): : 1472 - 1484
  • [24] A new model to assess the probability of occurrence of a species, based on presence-only data
    Beaugrand, G.
    Lenoir, S.
    Ibanez, F.
    Mante, C.
    MARINE ECOLOGY PROGRESS SERIES, 2011, 424 : 175 - 190
  • [25] A taxonomic-based joint species distribution model for presence-only data
    Escamilla Molgora, Juan M.
    Sedda, Luigi
    Diggle, Peter J.
    Atkinson, Peter M.
    JOURNAL OF THE ROYAL SOCIETY INTERFACE, 2022, 19 (187)
  • [26] Ground validation of presence-only modelling with rare species: a case study on barbastelles Barbastella barbastellus (Chiroptera: Vespertilionidae)
    Rebelo, Hugo
    Jones, Gareth
    JOURNAL OF APPLIED ECOLOGY, 2010, 47 (02) : 410 - 420
  • [27] On the existence of maximum likelihood estimates for presence-only data
    Hefley, Trevor J.
    Hooten, Mevin B.
    METHODS IN ECOLOGY AND EVOLUTION, 2015, 6 (06): : 648 - 655
  • [28] Predicting abundance with presence-only models
    Bradley, Bethany A.
    LANDSCAPE ECOLOGY, 2016, 31 (01) : 19 - 30
  • [29] Preferential sampling for presence/absence data and for fusion of presence/absence data with presence-only data
    Gelfand, Alan E.
    Shirota, Shinichiro
    ECOLOGICAL MONOGRAPHS, 2019, 89 (03)
  • [30] FINITE-SAMPLE EQUIVALENCE IN STATISTICAL MODELS FOR PRESENCE-ONLY DATA
    Fithian, William
    Hastie, Trevor
    ANNALS OF APPLIED STATISTICS, 2013, 7 (04) : 1917 - 1939