P-AutoClass: Scalable parallel clustering for mining large data sets

被引:33
作者
Pizzuti, C
Talia, D
机构
[1] CNR, ICAR, Inst High Performance Comp & Networking, I-87036 Arcavacata Di Rende, CS, Italy
[2] Univ Calabria, I-87036 Arcavacata Di Rende, CS, Italy
关键词
data mining; parallel processing; knowledge discovery; data clustering; unsupervised classification; isoefficiency; scalability;
D O I
10.1109/TKDE.2003.1198395
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data clustering is an important task in the area of data mining. Clustering is the unsupervised classification of data items into homogeneous groups called clusters. Clustering methods partition a set of data items into clusters, such that items in the same cluster are more similar to each other than items, in different clusters according to some defined criteria. Clustering algorithms are computationally intensive, particularly when they are used to analyze large amounts of data. A possible approach to reduce the processing time is based on the implementation of clustering algorithms on scalable parallel computers. This paper describes the design and implementation of P-AutoClass, a parallel version of the AutoClass system based upon the, Bayesian model for determining optimal classes in large data sets. The P-AutoClass implementation divides the clustering task among the processors of a multicomputer so that each processor works on its own partition and exchanges intermediate results with the other processors. The system architecture, its implementation, and experimental, performance results on different processor numbers and data sets are presented and compared-with theoretical performance. In particular, experimental and predicted scalability and efficiency of P-AutoClass versus the sequential AutoClass system are evaluated and compared.
引用
收藏
页码:629 / 641
页数:13
相关论文
共 31 条
[1]  
Agrawal R., 1998, P ACM SIGMOD
[2]  
Aldenderfer M.S., 1986, Cluster Analysis
[3]  
[Anonymous], ELECT ENG COMPUTER S
[4]  
[Anonymous], 1998, MINING VERY LARGE DA
[5]  
[Anonymous], 1996, P AAAI INT C KNOWL D
[6]  
CHEESEMAN P, 1989, ADV KNOWLEDGE DISCOV, V1217
[7]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[8]  
Everitt B., 1977, CLUSTER ANAL
[9]  
Everitt B. S., 1981, FINITE MIXTURE DISTR
[10]  
Fayyad U. M., 1996, ADV KNOWLEDGE DISCOV, P1, DOI DOI 10.1609/AIMAG.V17I3.1230