Cluster structure evaluation of a dyadic k-means algorithm for mining large image archives

被引:2
作者
Daschiel, H [1 ]
Datcu, M [1 ]
机构
[1] German Aerosp Ctr DLR, Remote Sensing Technol Inst, IMF, D-82234 Oberpfaffenhofen, Wessling, Germany
来源
IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING VIII | 2003年 / 4885卷
关键词
clustering; un-supervised classification; evaluation;
D O I
10.1117/12.463151
中图分类号
TP7 [遥感技术];
学科分类号
081102 ; 0816 ; 081602 ; 083002 ; 1404 ;
摘要
For many applications in data mining and knowledge discovery in databases, clustering methods are used for data reduction. If the amount of data increases like in image information mining, where one has to process GBytes of data, for instance, many of the existing clustering algorithms cannot be applied because of a high computational complexity. To overcome this disadvantage, we developed an efficient clustering algorithm called dyadic k-means. The algorithm is a modified and enhanced version of the traditional k-means. Whereas k-means has a computational complexity of O(nk) with n samples and k clusters, dyadic k-means has one of O(n log k). Our algorithm is particularly efficient for the grouping of very large data sets with a high number of clusters. In this article we will present statistically-based methods for the objective evaluation of clusters obtained by dyadic k-means. The main focus is on how well the clusters describe the data point distribution in a multidimensional feature space and how much information can be obtained from the clusters. Both the filling of the feature space with samples and the characterization of this configuration with dyadic k-means produced clusters will be considered. We will use the well-established scatter matrices to measure the compactness and separability of clustered groups in the feature space. The probability of error, which is another indicator for the characterization of samples in the feature space by clusters, will be calculated for each point, too. This probability delivers the relationship of each point to its cluster and can therefore be considered as a measurement of cluster reliability. We will test the evaluation methods both on a synthetic and a real world data set.
引用
收藏
页码:120 / 130
页数:11
相关论文
共 18 条
[1]   Feature normalization and likelihood-based similarity measures for image retrieval [J].
Aksoy, S ;
Haralick, RM .
PATTERN RECOGNITION LETTERS, 2001, 22 (05) :563-582
[2]  
[Anonymous], ELECT ENG COMPUTER S
[3]  
[Anonymous], 2001, GEOGRAPHIC DATA MINI
[4]  
CHEESEMAN P, 1995, ADV KNOWLEDGE DISCOV, P153
[5]  
DATCU M, 2002, P 4 EUR C SYNTH AP R
[6]  
Fukunaga K., 1990, INTRO STAT PATTERN R
[7]  
Hart, 2006, PATTERN CLASSIFICATI
[8]  
Hartigan J. A., 1975, CLUSTERING ALGORITHM
[9]  
Jain K, 1988, Algorithms for clustering data
[10]   Chameleon: Hierarchical clustering using dynamic modeling [J].
Karypis, G ;
Han, EH ;
Kumar, V .
COMPUTER, 1999, 32 (08) :68-+