Determining the number of clusters using the weighted gap statistic

被引:108
作者
Yan, Mingjin
Ye, Keying
机构
[1] Medtron Sofamor Danek, Memphis, TN 38132 USA
[2] Univ Texas San Antonio, Dept Management Sci & Stat, San Antonio, TX 78249 USA
关键词
cluster analysis; difference of difference-weighted (DD-weighted) gap statistic; gap statistic; multilayer clustering; weighted gap statistic;
D O I
10.1111/j.1541-0420.2007.00784.x
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Estimating the number of clusters in a data set is a crucial step in cluster analysis. In this article, motivated by the gap method (Tibshirani, Walther, and Hastie, 2001, Journal of the Royal Statistical Society B 63, 411-423), we propose the weighted gap and the difference of difference-weighted (DD-weighted) gap methods for estimating the number of clusters in data using the weighted within-clusters sum of errors: a measure of the within-clusters homogeneity. In addition, we propose a "multilayer" clustering approach, which is shown to be more accurate than the original gap method, particularly in detecting the nested cluster structure of the data. The methods are applicable when the input data contain continuous measurements and can be used with any clustering method. Simulation studies and real data are investigated and compared among these proposed methods as well as with the original gap method.
引用
收藏
页码:1031 / 1037
页数:7
相关论文
共 9 条
[1]  
[Anonymous], 1975, CLUSTERING ALGORITHM
[2]  
Calinski T., 1974, COMMUN STAT, V3, P1, DOI DOI 10.1080/03610927408827101
[3]  
Dudoit S, 2002, GENOME BIOL, V3
[4]  
EVERITT BS, 2001, ANN EUGEN, V7, P179
[5]  
Kaufman L., 2009, Finding groups in data: An introduction to cluster analysis
[6]   A CRITERION FOR DETERMINING THE NUMBER OF GROUPS IN A DATA SET USING SUM-OF-SQUARES CLUSTERING [J].
KRZANOWSKI, WJ ;
LAI, YT .
BIOMETRICS, 1988, 44 (01) :23-34
[7]   Finding the number of clusters in a dataset: An information-theoretic approach [J].
Sugar, CA ;
James, GM .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2003, 98 (463) :750-763
[8]   Estimating the number of clusters in a data set via the gap statistic [J].
Tibshirani, R ;
Walther, G ;
Hastie, T .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2001, 63 :411-423
[9]   MULTISURFACE METHOD OF PATTERN SEPARATION FOR MEDICAL DIAGNOSIS APPLIED TO BREAST CYTOLOGY [J].
WOLBERG, WH ;
MANGASARIAN, OL .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1990, 87 (23) :9193-9196