Modelling species presence-only data with random forests

被引:110
作者
Valavi, Roozbeh [1 ]
Elith, Jane [1 ]
Lahoz-Monfort, Jose J. [1 ]
Guillera-Arroita, Gurutzeta [1 ]
机构
[1] Univ Melbourne, Sch Biosci, Parkville, Vic, Australia
关键词
class imbalance; class overlap; down-sampling; ecological niche model; presence-background; recursive partitioning; POINT PROCESS MODELS; STATISTICAL COMPARISONS; DECISION TREES; CLASSIFICATION; REGRESSION; BIAS; DISTRIBUTIONS; CLASSIFIERS; PERFORMANCE;
D O I
10.1111/ecog.05615
中图分类号
X176 [生物多样性保护];
学科分类号
090705 ;
摘要
The random forest (RF) algorithm is an ensemble of classification or regression trees and is widely used, including for species distribution modelling (SDM). Many researchers use implementations of RF in the R programming language with default parameters to analyse species presence-only data together with 'background' samples. However, there is good evidence that RF with default parameters does not perform well for such 'presence-background' modelling. This is often attributed to the disparity between the number of presence and background samples, also known as 'class imbalance', and several solutions have been proposed. Here, we first set the context: the background sample should be large enough to represent all environments in the region. We then aim to understand the drivers of poor performance of RF when models are fitted to presence-only species data alongside background samples. We show that 'class overlap' (where both classes occur in the same environment) is an important driver of poor performance, alongside class imbalance. Class overlap can even degrade performance for presence-absence data. We explain, test and evaluate suggested solutions. Using simulated and real presence-background data, we compare performance of default RF with other weighting and sampling approaches. Our results demonstrate clear evidence of improvement in the performance of RFs when techniques that explicitly manage imbalance are used. We show that these either limit or enforce tree depth. Without compromising the environmental representativeness of the sampled background, we identify approaches to fitting RF that ameliorate the effects of imbalance and overlap and allow excellent predictive performance. Understanding the problems of RF in presence-background modelling allows new insights into how best to fit models, and should guide future efforts to best deal with such data.
引用
收藏
页码:1731 / 1742
页数:12
相关论文
共 50 条
  • [1] Selecting thresholds for the prediction of species occurrence with presence-only data
    Liu, Canran
    White, Matt
    Newell, Graeme
    JOURNAL OF BIOGEOGRAPHY, 2013, 40 (04) : 778 - 789
  • [2] Predictive performance of presence-only species distribution models: a benchmark study with reproducible code
    Valavi, Roozbeh
    Guillera-Arroita, Gurutzeta
    Lahoz-Monfort, Jose J.
    Elith, Jane
    ECOLOGICAL MONOGRAPHS, 2022, 92 (01)
  • [3] Species Distribution Modelling: Contrasting presence-only models with plot abundance data
    Gomes, Vitor H. F.
    Ijff, Stephanie D.
    Raes, Niels
    Amaral, Ieda Leao
    Salomao, Rafael P.
    Coelho, Luiz de Souza
    de Almeida Matos, Francisca Dionizia
    Castilho, Carolina V.
    Lima Filho, Diogenes de Andrade
    Cardenas Lopez, Dairon
    Ernesto Guevara, Juan
    Magnusson, William E.
    Phillips, Oliver L.
    Wittmann, Florian
    Veiga Carim, Marcelo de Jesus
    Martins, Maria Pires
    Irume, Mariana Victoria
    Sabatier, Daniel
    Molino, Jean-Francois
    Banki, Olaf S.
    da Silva Guimaraes, Jose Renan
    Pitman, Nigel C. A.
    Fernandez Piedade, Maria Teresa
    Mendoza, Abel Monteagudo
    Luize, Bruno Garcia
    Venticinque, Eduardo Martins
    Moraes de Leao Novo, Evlyn Marcia
    Vargas, Percy Nunez
    Freire Silva, Thiago Sanna
    Manzatto, Angelo Gilberto
    Terborgh, John
    Costa Reis, Neidiane Farias
    Montero, Juan Carlos
    Casula, Katia Regina
    Marimon, Beatriz S.
    Marimon, Ben-Hur
    Honorio Coronado, Euridice N.
    Feldpausch, Ted R.
    Duque, Alvaro
    Zartman, Charles Eugene
    Arboleda, Nicolas Castano
    Killeen, Timothy J.
    Mostacedo, Bonifacio
    Vasquez, Rodolfo
    Schongart, Jochen
    Assis, Rafael L.
    Medeiros, Marcelo Brilhante
    Simon, Marcelo Fragomeni
    Andrade, Ana
    Laurance, William F.
    SCIENTIFIC REPORTS, 2018, 8
  • [4] The use of classification and regression algorithms using the random forests method with presence-only data to model species' distribution
    Zhang, Lei
    Huettmann, Falk
    Zhang, Xudong
    Liu, Shirong
    Sun, Pengsen
    Yu, Zhen
    Mi, Chunrong
    METHODSX, 2019, 6 : 2281 - 2292
  • [5] Efficient Modelling of Presence-Only Species Data via Local Background Sampling
    Daniel, Jeffrey
    Horrocks, Julie
    Umphrey, Gary J.
    JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS, 2020, 25 (01) : 90 - 111
  • [6] Modelling species habitat suitability from presence-only data using kernel density estimation
    Zhang, Guiming
    Zhu, A-Xing
    Windels, Steve K.
    Qin, Cheng-Zhi
    ECOLOGICAL INDICATORS, 2018, 93 : 387 - 396
  • [7] On the selection of thresholds for predicting species occurrence with presence-only data
    Liu, Canran
    Newell, Graeme
    White, Matt
    ECOLOGY AND EVOLUTION, 2016, 6 (01): : 337 - 348
  • [8] Integrating distance sampling and presence-only data to estimate species abundance
    Farr, Matthew T.
    Green, David S.
    Holekamp, Kay E.
    Zipkin, Elise F.
    ECOLOGY, 2021, 102 (01)
  • [9] Presence-Only Data and the EM Algorithm
    Ward, Gill
    Hastie, Trevor
    Barry, Simon
    Elith, Jane
    Leathwick, John R.
    BIOMETRICS, 2009, 65 (02) : 554 - 563
  • [10] Presence-only versus presence-absence data in species composition determinant analyses
    Kent, Rafi
    Carmel, Yohay
    DIVERSITY AND DISTRIBUTIONS, 2011, 17 (03) : 474 - 479