A methodology to design heuristics for model selection based on the characteristics of data: Application to investigate when the Negative Binomial Lindley (NB-L) is preferred over the Negative Binomial (NB)

被引:20
作者
Shirazi, Mohammadali [1 ]
Dhavala, Soma Sekhar [2 ]
Lord, Dominique [1 ]
Geedipally, Srinivas Reddy [3 ]
机构
[1] Texas A&M Univ, Zachry Dept Civil Engn, College Stn, TX 77843 USA
[2] Perceptron Learning Solut Pvt Ltd, Bengaluru, India
[3] Texas A&M Transportat Inst, Arlington, TX 76013 USA
关键词
Model Selection; Heuristics; Characteristics of Data; Machine Learning; Negative Binomial; Negative Binomial Lindley; GENERALIZED LINEAR-MODEL; STATISTICAL-ANALYSIS; CRASH; BAYES;
D O I
10.1016/j.aap.2017.07.002
中图分类号
TB18 [人体工程学];
学科分类号
1201 ;
摘要
Safety analysts usually use post-modeling methods, such as the Goodness-of-Fit statistics or the Likelihood Ratio Test, to decide between two or more competitive distributions or models. Such metrics require all competitive distributions to be fitted to the data before any comparisons can be accomplished. Given the continuous growth in introducing new statistical distributions, choosing the best one using such post-modeling methods is not a trivial task, in addition to all theoretical or numerical issues the analyst may face during the analysis. Furthermore, and most importantly, these measures or tests do not provide any intuitions into why a specific distribution (or model) is preferred over another (Goodness-of-Logic). This paper ponders into these issues by proposing a methodology to design heuristics for Model Selection based on the characteristics of data, in terms of descriptive summary statistics, before fitting the models. The proposed methodology employs two analytic tools: (1) Monte-Carlo Simulations and (2) Machine Learning Classifiers, to design easy heuristics to predict the label of the 'most-likely-true' distribution for analyzing data. The proposed methodology was applied to investigate when the recently introduced Negative Binomial Lindley (NB-L) distribution is preferred over the Negative Binomial (NB) distribution. Heuristics were designed to select the 'most-likely-true' distribution between these two distributions, given a set of prescribed summary statistics of data. The proposed heuristics were successfully compared against classical tests for several real or observed datasets. Not only they are easy to use and do not need any post-modeling inputs, but also, using these heuristics, the analyst can attain useful information about why the NB-L is preferred over the NB - or vice versa- when modeling data.
引用
收藏
页码:186 / 194
页数:9
相关论文
共 19 条
[1]   Latent class analysis of the effects of age, gender, and alcohol consumption on driver-injury severities [J].
Behnood, Ali ;
Roshandeh, Arash M. ;
Mannering, Fred .
ANALYTIC METHODS IN ACCIDENT RESEARCH, 2014, 3-4 :56-91
[2]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]   The negative binomial-Lindley generalized linear model: Characteristics and application using crash data [J].
Geedipally, Srinivas Reddy ;
Lord, Dominique ;
Dhavala, Soma Sekhar .
ACCIDENT ANALYSIS AND PREVENTION, 2012, 45 :258-265
[5]  
Hastie T., 2009, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, P9
[6]  
James G, 2013, SPRINGER TEXTS STAT, V103, P1, DOI 10.1007/978-1-4614-7138-7_1
[7]  
LINDLEY DV, 1958, J ROY STAT SOC B, V20, P102
[8]  
Lord D., 2016, ATLAS201510
[9]   Application of the Conway-Maxwell-Poisson generalized linear model for analyzing motor vehicle crashes [J].
Lord, Dominique ;
Guikema, Seth D. ;
Geedipally, Srinivas Reddy .
ACCIDENT ANALYSIS AND PREVENTION, 2008, 40 (03) :1123-1134
[10]   The negative binomial-Lindley distribution as a tool for analyzing crash data characterized by a large amount of zeros [J].
Lord, Dominique ;
Geedipally, Srinivas Reddy .
ACCIDENT ANALYSIS AND PREVENTION, 2011, 43 (05) :1738-1742