Incremental model-based clustering for large datasets with small clusters

被引:34
作者
Fraley, C
Raftery, A
Wehrens, R
机构
[1] Univ Washington, Dept Stat, Seattle, WA 98195 USA
[2] Dept Analyt Chem, NL-6525 ED Nijmegen, Netherlands
基金
美国国家卫生研究院;
关键词
BIC; EM algorithm; image; MRI;
D O I
10.1198/106186005X59603
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be effective for determining the number of clusters, dealing with outliers, and selecting the best Clustering method in datasets that are small to moderate in size. For large datasets, current model-based clustering methods tend to be limited by memory and time requirements and the increasing difficulty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations. We propose an incremental approach for data that can be processed as a whole in memory, which is relatively efficient computationally and has the ability to find small clusters in large datasets. The method starts by drawing a random sample of the data, selecting and fitting a clustering model to the sample, and extending the model to the full dataset by additional EM iterations. New clusters are then added incrementally, initialized with the observations that are poorly fit by the current model. We demonstrate the effectiveness of this method by applying it to simulated data, and to image data where its performance can be assessed visually.
引用
收藏
页码:529 / 546
页数:18
相关论文
共 38 条
[1]   MODEL-BASED GAUSSIAN AND NON-GAUSSIAN CLUSTERING [J].
BANFIELD, JD ;
RAFTERY, AE .
BIOMETRICS, 1993, 49 (03) :803-821
[2]  
Bradley P. S., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P9
[3]   Linear flaw detection in woven textiles using model-based clustering [J].
Campbell, JG ;
Fraley, C ;
Murtagh, F ;
Raftery, AE .
PATTERN RECOGNITION LETTERS, 1997, 18 (14) :1539-1548
[4]  
Campbell JG, 1999, INT J IMAG SYST TECH, V10, P339, DOI 10.1002/(SICI)1098-1098(1999)10:4<339::AID-IMA5>3.0.CO
[5]  
2-3
[6]   GAUSSIAN PARSIMONIOUS CLUSTERING MODELS [J].
CELEUX, G ;
GOVAERT, G .
PATTERN RECOGNITION, 1995, 28 (05) :781-793
[7]   A component-wise EM algorithm for mixtures [J].
Celeux, G ;
Chrétien, S ;
Forbes, F ;
Mkhadri, A .
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2001, 10 (04) :697-712
[8]   Detecting features in spatial point processes with clutter via model-based clustering [J].
Dasgupta, A ;
Raftery, AE .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1998, 93 (441) :294-302
[9]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[10]  
DuMouchel W., 1999, Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, V15, P6