Clustering massive datasets with applications in software metrics and tomography

被引:26
作者
Maitra, R [1 ]
机构
[1] Univ Maryland Baltimore Cty, Dept Math & Stat, Baltimore, MD 21250 USA
基金
美国国家科学基金会; 美国安德鲁·梅隆基金会;
关键词
Gaussian distribution; likelihood ratio test; multistage procedure; sample;
D O I
10.1198/004017001316975925
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Clustering datasets is not an easy problem in general, and the difficulty is compounded for massive datasets. This article develops, under Gaussian assumptions, a multistage algorithm that clusters an initial sample. filters out observations that can be reasonably classified by these clusters, and iterates the preceding procedure on the remainder. A final step uses the estimated class probabilities and dispersions to classify each observation in the dataset. Results on test experiments indicate good performance. Application to datasets from software metrics and positron emission tomography required no more than five stages each, suggesting that the procedure is practical to implement.
引用
收藏
页码:336 / 346
页数:11
相关论文
共 43 条
[1]  
[Anonymous], 1970, AUTOMATION REMOTE CO
[2]  
[Anonymous], 1979, Multivariate analysis
[3]   MULTIDIMENSIONAL SCALING OF MEASURES OF DISTANCE BETWEEN PARTITIONS [J].
ARABIE, P ;
BOORMAN, SA .
JOURNAL OF MATHEMATICAL PSYCHOLOGY, 1973, 10 (02) :148-203
[4]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[5]  
BECKETT J, 1977, P SOC STAT SECT AM S, P983
[6]   PIECEWISE HIERARCHICAL-CLUSTERING [J].
BROSSIER, G .
JOURNAL OF CLASSIFICATION, 1990, 7 (02) :197-216
[7]   2 PARTITIONING TYPE CLUSTERING ALGORITHMS [J].
CAN, F ;
OZKARAHAN, EA .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1984, 35 (05) :268-276
[8]   GAUSSIAN PARSIMONIOUS CLUSTERING MODELS [J].
CELEUX, G ;
GOVAERT, G .
PATTERN RECOGNITION, 1995, 28 (05) :781-793
[9]  
CHENG L, 1974, MAR FISH REV, V36, P1
[10]   REVIEW OF CLASSIFICATION [J].
CORMACK, RM .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-GENERAL, 1971, 134 :321-+