Effective data summarization for hierarchical clustering in large datasets

被引:0
作者
Bidyut Kr. Patra
Sukumar Nandi
机构
[1] National Institute of Technology Rourkela,
[2] VTT Technical Research Centre of Finland,undefined
[3] Indian Institute of Technology Guwahati,undefined
来源
Knowledge and Information Systems | 2015年 / 42卷
关键词
Data summarization; Single-link; Large datasets; Leaders method;
D O I
暂无
中图分类号
学科分类号
摘要
Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link is a distance-based hierarchical clustering method, which can find non-convex (arbitrary)-shaped clusters in a dataset. However, this method cannot be used for clustering large dataset as this method either keeps entire dataset in main memory or scans dataset multiple times from secondary memory of the machine. Both of them are potentially severe problems for cluster analysis in large datasets. One remedy for both problems is to create a summary of a given dataset efficiently, and the summary is subsequently used to speed up clustering methods in large datasets. In this paper, we propose a summarization scheme termed data sphere (ds) to speed up single-link clustering method in large datasets. The ds utilizes sequential leaders clustering method to collect important statistics of a given dataset. The single-link method is modified to work with ds. Modified clustering method is termed as summarized single-link (SSL). The SSL method is considerably faster than the single-link method applied directly to the dataset, and clustering results produced by SSL method are close to the clustering results produced by single-link method. The SSL method outperforms single-link using data bubble (summarization scheme) both in terms of clustering accuracy and computation time. To speed up proposed summarization scheme, a technique is introduced to reduce a large number of distance computations in leaders method. Experimental studies demonstrate effectiveness of the proposed summarization scheme for large datasets.
引用
收藏
页码:1 / 20
页数:19
相关论文
共 25 条
[1]  
Ankerst M(1999)OPTICS: ordering points to identify the clustering structure SIGMOD Rec 28 49-60
[2]  
Breunig MM(2009)Rough-fuzzy weighted k-nearest leader classifier for large data sets Pattern Recognit 42 1719-1731
[3]  
Kriegel HP(2010)Data clustering: 50 years beyond k-means Pattern Recognit Lett 31 651-666
[4]  
Sander J(1999)Data clustering: a review ACM Comput Surv 31 264-323
[5]  
Babu VS(1967)Step-wise clustering procedures J Am Stat Assoc 62 86-101
[6]  
Viswanath P(1982)Least squares quantization in PCM IEEE Trans Inf Theory 28 129-137
[7]  
Jain AK(1984)Complexities of hierarchic clustering algorithms: state of the art Comput Stat Q 1 101-113
[8]  
Jain AK(2002)CLARANS: a method for clustering objects for spatial data mining IEEE Trans Knowl Data Eng 14 1003-1016
[9]  
Murty MN(2011)A distance based clustering method for arbitrary shaped clusters in large datasets Pattern Recognit 44 2862-2870
[10]  
Flynn PJ(1971)Objective criteria for evaluation of clustering methods J Am Stat Assoc 66 846-850