A fast algorithm for computing distance correlation

被引:40
作者
Chaudhuri, Arin [1 ]
Hu, Wenhao [1 ]
机构
[1] SAS Inst Inc, Internet Things, 500 SAS Campus Dr, Cary, NC 27513 USA
关键词
Distance correlation; Dependency measure; Fast algorithm; Merge sort; DEPENDENCE;
D O I
10.1016/j.csda.2019.01.016
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Classical dependence measures such as Pearson correlation, Spearman's rho, and Kendall's tau can detect only monotonic or linear dependence. To overcome these limitations, Szekely et al. proposed distance covariance and its derived correlation. The distance covariance is a weighted L-2 distance between the joint characteristic function and the product of marginal distributions; it is 0 if and only if two random vectors X and Y are independent. This measure can detect the presence of a dependence structure when the sample size is large enough. They further showed that the sample distance covariance can be calculated simply from modified Euclidean distances, which typically requires O(n(2)) cost, where n is the sample size. Quadratic computing time greatly limits the use of the distance covariance for large data. To calculate the sample distance covariance between two univariate random variables, a simple, exact O(n log(n)) algorithms is developed. The proposed algorithm essentially consists of two sorting steps, so it is easy to implement. Empirical results show that the proposed algorithm is significantly faster than state-of-the-art methods. The algorithm's speed will enable researchers to explore complicated dependence structures in large datasets. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:15 / 24
页数:10
相关论文
共 18 条
[1]  
[Anonymous], 2018, SOFTWARE OPTIMIZATIO
[2]  
[Anonymous], INT 64 IA 32 ARCH OP
[3]  
Cormen Thomas H., 2001, Introduction to Algorithms
[4]  
Ginat D., 2004, SIGCSE Bulletin, V36, P82, DOI 10.1145/1026487.1008020
[5]  
Gretton A, 2005, LECT NOTES ARTIF INT, V3734, P63
[6]  
Gretton A., 2008, Advances in Neural Information Processing Systems, V20, P585
[7]   Inferring Nonlinear Gene Regulatory Networks from Gene Expression Data Based on Distance Correlation [J].
Guo, Xiaobo ;
Zhang, Ye ;
Hu, Wenhao ;
Tan, Haizhu ;
Wang, Xueqin .
PLOS ONE, 2014, 9 (02)
[8]  
Huang C, 2017, ARXIV170106054
[9]   Fast Computing for Distance Covariance [J].
Huo, Xiaoming ;
Szekely, Gabor J. .
TECHNOMETRICS, 2016, 58 (04) :435-447
[10]   The influence of caches on the performance of sorting [J].
LaMarca, A ;
Ladner, RE .
JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 1999, 31 (01) :66-104