Modelling species presence-only data with random forests

被引：110

作者：

Valavi, Roozbeh ^{[1
]}

Elith, Jane ^{[1
]}

Lahoz-Monfort, Jose J. ^{[1
]}

Guillera-Arroita, Gurutzeta ^{[1
]}

机构：

[1] Univ Melbourne, Sch Biosci, Parkville, Vic, Australia

来源：

ECOGRAPHY | 2021年 / 44卷 / 12期

关键词：

class imbalance; class overlap; down-sampling; ecological niche model; presence-background; recursive partitioning; POINT PROCESS MODELS; STATISTICAL COMPARISONS; DECISION TREES; CLASSIFICATION; REGRESSION; BIAS; DISTRIBUTIONS; CLASSIFIERS; PERFORMANCE;

D O I：

10.1111/ecog.05615

中图分类号：

X176 [生物多样性保护];

学科分类号：

090705 ;

摘要：

The random forest (RF) algorithm is an ensemble of classification or regression trees and is widely used, including for species distribution modelling (SDM). Many researchers use implementations of RF in the R programming language with default parameters to analyse species presence-only data together with 'background' samples. However, there is good evidence that RF with default parameters does not perform well for such 'presence-background' modelling. This is often attributed to the disparity between the number of presence and background samples, also known as 'class imbalance', and several solutions have been proposed. Here, we first set the context: the background sample should be large enough to represent all environments in the region. We then aim to understand the drivers of poor performance of RF when models are fitted to presence-only species data alongside background samples. We show that 'class overlap' (where both classes occur in the same environment) is an important driver of poor performance, alongside class imbalance. Class overlap can even degrade performance for presence-absence data. We explain, test and evaluate suggested solutions. Using simulated and real presence-background data, we compare performance of default RF with other weighting and sampling approaches. Our results demonstrate clear evidence of improvement in the performance of RFs when techniques that explicitly manage imbalance are used. We show that these either limit or enforce tree depth. Without compromising the environmental representativeness of the sampled background, we identify approaches to fitting RF that ameliorate the effects of imbalance and overlap and allow excellent predictive performance. Understanding the problems of RF in presence-background modelling allows new insights into how best to fit models, and should guide future efforts to best deal with such data.

引用

页码：1731 / 1742

页数：12

共 50 条

[1] Selecting thresholds for the prediction of species occurrence with presence-only data
Liu, Canran
White, Matt
Newell, Graeme
JOURNAL OF BIOGEOGRAPHY, 2013, 40 (04) : 778 - 789
[2] Predictive performance of presence-only species distribution models: a benchmark study with reproducible code
Valavi, Roozbeh
Guillera-Arroita, Gurutzeta
Lahoz-Monfort, Jose J.
Elith, Jane
ECOLOGICAL MONOGRAPHS, 2022, 92 (01)
[3] Species Distribution Modelling: Contrasting presence-only models with plot abundance data
Gomes, Vitor H. F.
Ijff, Stephanie D.
Raes, Niels
Amaral, Ieda Leao
Salomao, Rafael P.
Coelho, Luiz de Souza
de Almeida Matos, Francisca Dionizia
Castilho, Carolina V.
Lima Filho, Diogenes de Andrade
Cardenas Lopez, Dairon
Ernesto Guevara, Juan
Magnusson, William E.
Phillips, Oliver L.
Wittmann, Florian
Veiga Carim, Marcelo de Jesus
Martins, Maria Pires
Irume, Mariana Victoria
Sabatier, Daniel
Molino, Jean-Francois
Banki, Olaf S.
da Silva Guimaraes, Jose Renan
Pitman, Nigel C. A.
Fernandez Piedade, Maria Teresa
Mendoza, Abel Monteagudo
Luize, Bruno Garcia
Venticinque, Eduardo Martins
Moraes de Leao Novo, Evlyn Marcia
Vargas, Percy Nunez
Freire Silva, Thiago Sanna
Manzatto, Angelo Gilberto
Terborgh, John
Costa Reis, Neidiane Farias
Montero, Juan Carlos
Casula, Katia Regina
Marimon, Beatriz S.
Marimon, Ben-Hur
Honorio Coronado, Euridice N.
Feldpausch, Ted R.
Duque, Alvaro
Zartman, Charles Eugene
Arboleda, Nicolas Castano
Killeen, Timothy J.
Mostacedo, Bonifacio
Vasquez, Rodolfo
Schongart, Jochen
Assis, Rafael L.
Medeiros, Marcelo Brilhante
Simon, Marcelo Fragomeni
Andrade, Ana
Laurance, William F.
SCIENTIFIC REPORTS, 2018, 8
[4] The use of classification and regression algorithms using the random forests method with presence-only data to model species' distribution
Zhang, Lei
Huettmann, Falk
Zhang, Xudong
Liu, Shirong
Sun, Pengsen
Yu, Zhen
Mi, Chunrong
METHODSX, 2019, 6 : 2281 - 2292
[5] Efficient Modelling of Presence-Only Species Data via Local Background Sampling
Daniel, Jeffrey
Horrocks, Julie
Umphrey, Gary J.
JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS, 2020, 25 (01) : 90 - 111
[6] Modelling species habitat suitability from presence-only data using kernel density estimation
Zhang, Guiming
Zhu, A-Xing
Windels, Steve K.
Qin, Cheng-Zhi
ECOLOGICAL INDICATORS, 2018, 93 : 387 - 396
[7] On the selection of thresholds for predicting species occurrence with presence-only data
Liu, Canran
Newell, Graeme
White, Matt
ECOLOGY AND EVOLUTION, 2016, 6 (01): : 337 - 348
[8] Integrating distance sampling and presence-only data to estimate species abundance
Farr, Matthew T.
Green, David S.
Holekamp, Kay E.
Zipkin, Elise F.
ECOLOGY, 2021, 102 (01)
[9] Presence-Only Data and the EM Algorithm
Ward, Gill
Hastie, Trevor
Barry, Simon
Elith, Jane
Leathwick, John R.
BIOMETRICS, 2009, 65 (02) : 554 - 563
[10] Presence-only versus presence-absence data in species composition determinant analyses
Kent, Rafi
Carmel, Yohay
DIVERSITY AND DISTRIBUTIONS, 2011, 17 (03) : 474 - 479

← 1 2 3 4 5 →