A new correlation coefficient between categorical, ordinal and interval variables with Pearson characteristics

被引:123
作者
Baak, M. [1 ]
Koopman, R. [1 ]
Snoek, H. [2 ,3 ]
Klous, S. [1 ,4 ]
机构
[1] KPMG Advisory NV, Laan Langerhuize 1, NL-1186 DS Amstelveen, Netherlands
[2] Nikhef Natl Inst Subat Phys, Sci Pk 105, NL-1098 XG Amsterdam, Netherlands
[3] Univ Amsterdam, Inst Phys, Sci Pk 904, NL-1098 XH Amsterdam, Netherlands
[4] Univ Amsterdam, Informat Inst, Sci Pk 904, NL-1098 XH Amsterdam, Netherlands
关键词
Data analysis; Correlation; Contingency test; Significance; Simulation; CROSS CLASSIFICATIONS; ASSOCIATION; TESTS;
D O I
10.1016/j.csda.2020.107043
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
A prescription is presented for a new and practical correlation coefficient, phi(K), based on several refinements to Pearson's hypothesis test of independence of two variables. The combined features of phi(K) form an advantage over existing coefficients. Primarily, it works consistently between categorical, ordinal and interval variables, in essence by treating each variable as categorical, and can therefore be used to calculate correlations between variables of mixed type. Second, it captures nonlinear dependency. The strength of phi(K) is similar to Pearson's correlation coefficient, and is equivalent in case of a bivariate normal input distribution. These are useful properties when studying the correlations between variables with mixed types, where some are categorical. Two more innovations are presented: to the proper evaluation of statistical significance of correlations, and to the interpretation of variable relationships in a contingency table, in particular in case of sparse or low statistics samples and significant dependencies. Two practical applications are discussed. The presented algorithms are easy to use and available through a public Python library.(1) (C) 2020 Published by Elsevier B.V.
引用
收藏
页数:25
相关论文
共 38 条