Statistical power for cluster analysis

被引:286
作者
Dalmaijer, Edwin S. [1 ]
Nord, Camilla L. [1 ]
Astle, Duncan E. [1 ]
机构
[1] Univ Cambridge, MRC Cognit & Brain Sci Unit, 15 Chaucer Rd, Cambridge CB2 7EF, England
关键词
Statistical power; Dimensionality reduction; Cluster analysis; Latent class analysis; Latent profile analysis; Simulation; Sample size; Effect size; Covariance; VALIDATION; FAILURE;
D O I
10.1186/s12859-022-04675-1
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and classification accuracy for common analysis pipelines through simulation. We systematically varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we directly compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent class analysis and latent profile analysis). Results We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N = 20 per subgroup), provided cluster separation is large (Delta = 4). Finally, we demonstrated that fuzzy clustering can provide a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Delta = 3). Conclusions Traditional intuitions about statistical power only partially apply to cluster analysis: increasing the number of participants above a sufficient sample size did not improve power, but effect size was crucial. Notably, for the popular dimensionality reduction and clustering algorithms tested here, power was only satisfactory for relatively large effect sizes (clear separation between subgroups). Fuzzy clustering provided higher power in multivariate normal distributions. Overall, we recommend that researchers (1) only apply cluster analysis when large subgroup separation is expected, (2) aim for sample sizes of N = 20 to N = 30 per expected subgroup, (3) use multi-dimensional scaling to improve cluster separation, and (4) use fuzzy clustering or mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.
引用
收藏
页数:28
相关论文
共 57 条
[1]   Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables [J].
Ahlqvist, Emma ;
Storm, Petter ;
Karajamaki, Annemari ;
Martinell, Mats ;
Dorkhan, Mozhgan ;
Carlsson, Annelie ;
Vikman, Petter ;
Prasad, Rashmi B. ;
Aly, Dina Mansour ;
Almgren, Peter ;
Wessman, Ylva ;
Shaat, Nael ;
Spegel, Peter ;
Mulder, Hindrik ;
Lindholm, Eero ;
Melander, Olle ;
Hansson, Ola ;
Malmqvist, Ulf ;
Lernmark, Ake ;
Lahti, Kaj ;
Forsen, Tom ;
Tuomi, Tiinamaija ;
Rosengren, Anders H. ;
Groop, Leif .
LANCET DIABETES & ENDOCRINOLOGY, 2018, 6 (05) :361-369
[2]   Novel subgroups of type 2 diabetes and their association with microvascular outcomes in an Asian Indian population: a data-driven cluster analysis: the INSPIRED study [J].
Anjana, Ranjit Mohan ;
Baskar, Viswanathan ;
Nair, Anand Thakarakkattil Narayanan ;
Jebarani, Saravanan ;
Siddiqui, Moneeza Kalhan ;
Pradeepa, Rajendra ;
Unnikrishnan, Ranjit ;
Palmer, Colin ;
Pearson, Ewan ;
Mohan, Viswanathan .
BMJ OPEN DIABETES RESEARCH & CARE, 2020, 8 (01)
[3]  
[Anonymous], 1981, PATTERN RECOGN
[4]   An extensive comparative study of cluster validity indices [J].
Arbelaitz, Olatz ;
Gurrutxaga, Ibai ;
Muguerza, Javier ;
Perez, Jesus M. ;
Perona, Inigo .
PATTERN RECOGNITION, 2013, 46 (01) :243-256
[5]   Remapping the cognitive and neural profiles of children who struggle at school [J].
Astle, Duncan E. ;
Bathelt, Joe ;
Holmes, Joni .
DEVELOPMENTAL SCIENCE, 2019, 22 (01)
[6]   MEASURING POWER OF HIERARCHICAL CLUSTER-ANALYSIS [J].
BAKER, FB ;
HUBERT, LJ .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1975, 70 (349) :31-38
[7]  
Bathelt J, 2017, NEUROSCIENCE, DOI [10.1101/237859, DOI 10.1101/237859]
[8]   Data-Driven Subtyping of Executive Function Related Behavioral Problems in Children [J].
Bathelt, Joe ;
Holmes, Joni ;
Astle, Duncan E. .
JOURNAL OF THE AMERICAN ACADEMY OF CHILD AND ADOLESCENT PSYCHIATRY, 2018, 57 (04) :252-+
[9]   DYNAMIC PROGRAMMING [J].
BELLMAN, R .
SCIENCE, 1966, 153 (3731) :34-&
[10]   Multi-target visual search organisation across the lifespan: cancellation task performance in a large and demographically stratified sample of healthy adults [J].
Benjamins, Jeroen S. ;
Dalmaijer, Edwin S. ;
Ten Brink, Antonia F. ;
Nijboer, Tanja C. W. ;
Van der Stigchel, Stefan .
AGING NEUROPSYCHOLOGY AND COGNITION, 2019, 26 (05) :731-748