Integrating K-means clustering with a relational DBMS using SQL

被引:44
作者
Ordonez, C [1 ]
机构
[1] Teradata, NCR, San Diego, CA 92127 USA
关键词
clustering; K-means; SQL; relational DBMS;
D O I
10.1109/TKDE.2006.31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Integrating data mining algorithms with a relational DBMS is an important problem for database programmers. We introduce three SQL implementations of the popular K-means clustering algorithm to integrate it with a relational DBMS: 1) a straightforward translation of K-means computations into SQL, 2) an optimized version based on improved data organization, efficient indexing, sufficient statistics, and rewritten queries, and 3) an incremental version that uses the optimized version as a building block with fast convergence and automated reseeding. We experimentally show the proposed K-means implementations work correctly and can cluster large data sets. We identify which K-means computations are more critical for performance. The optimized and incremental K-means implementations exhibit linear scalability. We compare K-means implementations in SQL and C++ with respect to speed and scalability and we also study the time to export data sets outside of the DBMS. Experiments show that SQL overhead is significant for small data sets, but relatively low for large data sets, whereas export times become a bottleneck for C++.
引用
收藏
页码:188 / 201
页数:14
相关论文
共 28 条
[1]  
AGGARWAL CC, 2000, P ACM SIGMOD INT C M, P70, DOI DOI 10.1145/335191
[2]  
[Anonymous], P 1998 ACM SIGMOD IN
[3]  
[Anonymous], 2004, P 10 ACM SIGKDD INT, DOI 10.1145/1014052.1016921
[4]  
Bayardo R.J., 1999, P 5 ACM SIGKDD INT C
[5]  
Bradley P. S., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P9
[6]  
CLEAR J, 1999, ACM KDD C, P425
[7]  
Duda R. O., 2000, PATTERN CLASSIFICATI
[8]  
Farnstrom F., 2000, SIGKDD Explor. Newsl, V2, P51, DOI DOI 10.1145/360402.360419
[9]  
FAYYAD U, 1996, COMMUN ACM, V39, P11, DOI DOI 10.1145/240455.240458
[10]   The LBG-U method for vector quantization - An improvement over LEG inspired from neural networks [J].
Fritzke, B .
NEURAL PROCESSING LETTERS, 1997, 5 (01) :35-45