Clustering massive datasets with applications in software metrics and tomography

被引:26
作者
Maitra, R [1 ]
机构
[1] Univ Maryland Baltimore Cty, Dept Math & Stat, Baltimore, MD 21250 USA
基金
美国国家科学基金会; 美国安德鲁·梅隆基金会;
关键词
Gaussian distribution; likelihood ratio test; multistage procedure; sample;
D O I
10.1198/004017001316975925
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Clustering datasets is not an easy problem in general, and the difficulty is compounded for massive datasets. This article develops, under Gaussian assumptions, a multistage algorithm that clusters an initial sample. filters out observations that can be reasonably classified by these clusters, and iterates the preceding procedure on the remainder. A final step uses the estimated class probabilities and dispersions to classify each observation in the dataset. Results on test experiments indicate good performance. Application to datasets from software metrics and positron emission tomography required no more than five stages each, suggesting that the procedure is practical to implement.
引用
收藏
页码:336 / 346
页数:11
相关论文
共 43 条
[31]  
Murtagh F., 1985, MULTIDIMENSIONAL CLU
[32]  
MYERS GJ, 1978, COMPOSITE STRUCTURED
[33]  
O'Sullivan F, 1994, Stat Methods Med Res, V3, P87, DOI 10.1177/096228029400300106
[34]   TOMOGRAPHIC MEASUREMENT OF LOCAL CEREBRAL GLUCOSE METABOLIC-RATE IN HUMANS WITH (F-18)2-FLUORO-2-DEOXY-D-GLUCOSE - VALIDATION OF METHOD [J].
PHELPS, ME ;
HUANG, SC ;
HOFFMAN, EJ ;
SELIN, C ;
SOKOLOFF, L ;
KUHL, DE .
ANNALS OF NEUROLOGY, 1979, 6 (05) :371-388
[35]   STRONG CONSISTENCY OF K-MEANS CLUSTERING [J].
POLLARD, D .
ANNALS OF STATISTICS, 1981, 9 (01) :135-140
[36]  
Ramey D. B., 1985, ENCY STAT SCI, V6, P318
[38]  
RIPLEY BD, 1991, ANAL MODELLING DATA, P85
[39]   CLUSTERING METHODS BASED ON LIKELIHOOD RATIO CRITERIA [J].
SCOTT, AJ ;
SYMONS, MJ .
BIOMETRICS, 1971, 27 (02) :387-&
[40]   CLUSTERING CRITERIA AND MULTIVARIATE NORMAL MIXTURES [J].
SYMONS, MJ .
BIOMETRICS, 1981, 37 (01) :35-43