Supervised hierarchical clustering using CART

被引:0
作者
Hancock, TP [1 ]
Coomans, DH [1 ]
Everingham, YL [1 ]
机构
[1] James Cook Univ N Queensland, Dept Math & Stat, Townsville, Qld 4811, Australia
来源
MODSIM 2003: INTERNATIONAL CONGRESS ON MODELLING AND SIMULATION, VOLS 1-4: VOL 1: NATURAL SYSTEMS, PT 1; VOL 2: NATURAL SYSTEMS, PT 2; VOL 3: SOCIO-ECONOMIC SYSTEMS; VOL 4: GENERAL SYSTEMS | 2003年
关键词
CART; supervised clustering; sea surface temperatures;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The size and complexity of current data mining data sets have eclipsed the limits of traditional statistical techniques. Such large datasets frequently require some form of cluster analysis, usually in the form of a hierarchical cluster analysis. However the implementation of a traditional hierarchical scheme on large datasets requires an additional cluster validation analysis. Classification and Regression Trees (CART) are a non-parametric regression and classification technique that have become popular within the biotechnology and ecological fields. CARTS intuitive interpretation, and ability to handle large datasets make it easily accessible to the non-statistician by presenting the statistical relationships found in the form of a binary tree. This paper proposes a supervised clustering algorithm capable of finding real clusters within large datasets by using CART as a means of filtering the clusters found using any hierarchical technique. The supervision performed by CART acts as a filter of the results from a hierarchical cluster analysis by merging or removing poorly defined groups. It is common practice to validate a cluster analysis using descriminant analysis, however this assumes that the correct number of clusters is known. CART implements a selective classification of groups allowing for some groups not to be explicitly classified, a feature not supported by standard descriminant analysis. This selective classification acts in two fold, firstly by filtering or merging clusters that are not validated by the data, and secondly, as a relationship model for the clusters found and provides statistical measures of certainty over the analysis. An example of this method is presented using Sea Surface Temperatures (SST). This is an ideal choice as very little statistical cluster analysis has been implemented on this dataset, yet knowledge of such structure is in high demand. The analysis is performed for one month November for the years 1940 through to 2002, where some of the most useful variation is expected. The supervised clustering technique successful extracted seven meaningful clusters, which predicted with a cross-validated classification rate of 0.50.
引用
收藏
页码:1880 / 1885
页数:6
相关论文
共 11 条