The consequences of checking for zero-inflation and overdispersion in the analysis of count data

被引:36
作者
Campbell, Harlan [1 ]
机构
[1] Univ British Columbia, Vancouver, BC, Canada
来源
METHODS IN ECOLOGY AND EVOLUTION | 2021年 / 12卷 / 04期
关键词
model selection bias; overdispersion; zero‐ inflated models; inflation; POISSON REGRESSION-MODEL; LIKELIHOOD RATIO; SCORE TEST; SELECTION; INFERENCE; ECOLOGY; TESTS; ASSUMPTIONS;
D O I
10.1111/2041-210X.13559
中图分类号
Q14 [生态学(生物生态学)];
学科分类号
071012 ; 0713 ;
摘要
Count data are ubiquitous in ecology and the Poisson generalized linear model (GLM) is commonly used to model the association between counts and explanatory variables of interest. When fitting this model to the data, one typically proceeds by first confirming that the model assumptions are satisfied. If the residuals appear to be overdispersed or if there is zero-inflation, key assumptions of the Poison GLM may be violated and researchers will then typically consider alternatives to the Poison GLM. An important question is whether the potential model selection bias introduced by this data-driven multi-stage procedure merits concern. Here we conduct a large-scale simulation study to investigate the potential consequences of model selection bias that can arise in the simple scenario of analysing a sample of potentially overdispersed, potentially zero-inflated, count data. Specifically, we investigate model selection procedures recently recommended by Blasco-Moreno et al. (2019) using either a series of score tests or information theoretic criteria to select the best model. We find that, when sample sizes are small, model selection based on preliminary score tests (or information theoretic criteria, e.g. AIC, BIC) can lead to potentially substantial inflation of false positive rates (i.e. type 1 error inflation). When sample sizes are sufficiently large, model selection based on preliminary score tests, is not problematic. Ignoring the possibility of overdispersion and zero-inflation during data analyses can lead to invalid inference. However, if one does not have sufficient power to test for overdispersion and zero-inflation, post hoc model selection may also lead to substantial bias. This 'catch-22' suggests that, if sample sizes are small, a healthy skepticism is warranted whenever one rejects the null hypothesis of no association between a given outcome and covariate.
引用
收藏
页码:665 / 680
页数:16
相关论文
共 75 条
[1]  
Albers, 2019, META PSYCHOL, V3, P1592
[2]   Retire statistical significance [J].
Amrhein, Valentin ;
Greenland, Sander ;
McShane, Blake .
NATURE, 2019, 567 (7748) :305-307
[3]  
Anderson DR., 2007, MODEL BASED INFERENC
[4]   Compute Canada: Advancing Computational Research [J].
Baldwin, Susan .
HIGH PERFORMANCE COMPUTING SYMPOSIUM 2011, 2012, 341
[5]  
Bening V.E., 2012, Generalized Poisson models and their applications in insurance and finance
[6]   What does a zero mean? Understanding false, random and structural zeros in ecology [J].
Blasco-Moreno, Anabel ;
Perez-Casany, Marta ;
Puig, Pedro ;
Morante, Maria ;
Castells, Eva .
METHODS IN ECOLOGY AND EVOLUTION, 2019, 10 (07) :949-959
[8]   The relative performance of AIC, AICC and BIC in the presence of unobserved heterogeneity [J].
Brewer, Mark J. ;
Butler, Adam ;
Cooksley, Susan L. .
METHODS IN ECOLOGY AND EVOLUTION, 2016, 7 (06) :679-692
[9]   Model selection: An integral part of inference [J].
Buckland, ST ;
Burnham, KP ;
Augustin, NH .
BIOMETRICS, 1997, 53 (02) :603-618
[10]   Multimodel inference - understanding AIC and BIC in model selection [J].
Burnham, KP ;
Anderson, DR .
SOCIOLOGICAL METHODS & RESEARCH, 2004, 33 (02) :261-304