A simple density with distance based initial seed selection technique for K means algorithm

被引:4
作者
Azimuddin S.S. [1 ]
Desikan K. [2 ]
机构
[1] School of computing sicence and engineerg VIT, Chennai
[2] Department of mathematics, School of Advanced Sciences VIT, Chennai
关键词
Density; Distance; Initial seed concept; K means; Single pass;
D O I
10.20532/cit.2017.1003605
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Open issues with respect to K means algorithm are - identifying the number of clusters, initial seed concept selection, clustering tendency, handling empty clusters, identifying outliers etc. In this paper we propose a novel and simple technique considering both density and distance of the concepts in a dataset to identify initial seed concepts for clustering. Many authors have proposed different techniques to identify initial seed concepts; but our method ensures that the initial seed concepts are chosen from different clusters which are to be generated by the clustering solution. The hallmark of our algorithm is that it is a single pass algorithm which does not require any extra parameters to be estimated. Further, our seed concepts are among the actual concepts and not the mean of representative concepts as is the case in many other algorithms. We have implemented our proposed algorithm and compared the results with the interval based technique of Khan. We see that our method outperforms the interval based method. We have also compared our method with the original random K means and K Means ++ algorithms.
引用
收藏
页码:291 / 300
页数:9
相关论文
共 21 条
[1]  
Jain A.K., Et al., Data Clustering: A Review, ACM Computing Surveys, 31, 3, (1999)
[2]  
Lloyd S.P., Least Squares Quantization in PCM, IEEE Transactions on Information Theory, 28, 2, pp. 129-136, (1982)
[3]  
Sarle W.S., Cluster Analysis by Least Squares, In Proceedings of the Seventh Annual SAS Users Group International Conference, pp. 651-653
[4]  
Arthur D., Vassilvitskii S., How Slow is the K-Means Method?, In Proceedings of the twenty-second annual symposium on computational geometry, (2006)
[5]  
Astrahan M.M., Speech Analysis by Clustering or the Hyperphoneme Method, Tech. Rep. AIM-124, (1970)
[6]  
Cao F., Et al., An Initialization Method for the K-Means Algorithm Using Neighborhood Model, Computers and Mathematics with Applications, 58, 3, pp. 474-483, (2009)
[7]  
Arthur D., Vassilvitskii S., K-means++: The Advantages of Careful Seeding, In SODA, pp. 1027-1035, (2007)
[8]  
Ostrovsky R., Et al., The Effectiveness of Lloyd-Type Methods for the K-Means Problem, In Symposium on Foundations of Computer Science, (2006)
[9]  
Effros M., Schulman L.J., Deterministic Clustering with Data Nets, In Proc. ISIT, (2004)
[10]  
Khan F., An Initial Seed Selection Algorithm for K-Means Clustering of Geo-Referenced Data to Improve Replicability of Cluster Assignments for Mapping Application, Applied Soft Computing, 12, 11, pp. 3698-3700, (2012)