Analyzing the fine structure of distributions

被引:42
作者
Thrun, Michael C. [1 ,2 ]
Gehlert, Tino [3 ]
Ultsch, Alfred [1 ]
机构
[1] Philipps Univ Marburg, Dept Math & Comp Sci, Datab AG, Marburg, Germany
[2] Philipps Univ Marburg, Dept Hematol Oncol & Immunol, Marburg, Germany
[3] Tech Univ Chemnitz, Alumni Fac Math, Chemnitz, Germany
关键词
BOX; BIMODALITY;
D O I
10.1371/journal.pone.0238835
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods.
引用
收藏
页数:20
相关论文
共 59 条
[1]  
Adler D., vioplot: violin plot
[2]   powerlaw: A Python']Python Package for Analysis of Heavy-Tailed Distributions [J].
Alstott, Jeff ;
Bullmore, Edward T. ;
Plenz, Dietmar .
PLOS ONE, 2014, 9 (01)
[3]  
[Anonymous], 2001, SciPy: open source scientific tools for Python, DOI DOI 10.1002/MP.16056
[4]   OPENING THE BOX OF A BOXPLOT [J].
BENJAMINI, Y .
AMERICAN STATISTICIAN, 1988, 42 (04) :257-262
[5]  
Bowman A.W., 2014, R package "sm": nonparametric smoothing methods (version 2.2-5.4)
[6]  
Bowman A.W., 1997, APPL SMOOTHING TECHN, DOI DOI 10.1007/S001800000033
[7]   RECENT ECONOMETRIC MODELING OF CRIME AND PUNISHMENT - SUPPORT FOR THE DETERRENCE HYPOTHESIS [J].
BRIER, SS ;
FIENBERG, SE .
EVALUATION REVIEW, 1980, 4 (02) :147-191
[8]   HEAVY-TAILED DISTRIBUTIONS - PROPERTIES AND TESTS [J].
BRYSON, MC .
TECHNOMETRICS, 1974, 16 (01) :61-68
[9]   TRANSFORMATION TO NORMALITY OF NULL DISTRIBUTION OF G1 [J].
DAGOSTIN.RB .
BIOMETRIKA, 1970, 57 (03) :679-&
[10]   On Bayesian modeling of fat tails and skewness [J].
Fernandez, C ;
Steel, MFJ .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1998, 93 (441) :359-371