HARP: A practical projected clustering algorithm

被引:95
作者
Yip, KY
Cheung, DW
Ng, MK
机构
[1] Univ Hong Kong, Dept Comp Sci & Informat Syst, Hong Kong, Hong Kong, Peoples R China
[2] Univ Hong Kong, Dept Math, Hong Kong, Hong Kong, Peoples R China
关键词
data mining; mining methods and algorithms; clustering; bioinformatics;
D O I
10.1109/TKDE.2004.74
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In high-dimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. The clustering accuracy can be seriously degraded if incorrect values are used. Unfortunately, in real situations, it is rarely possible for users to supply the parameter values accurately, which causes practical difficulties in applying these algorithms to real data. In this paper, we analyze the major challenges of projected clustering and suggest why these algorithms need to depend heavily on user parameters. Based on the analysis, we propose a new algorithm that exploits the clustering status to adjust the internal thresholds dynamically without the assistance of user parameters. According to the results of extensive experiments on real and synthetic data, the new method has excellent accuracy and usability. It outperformed the other algorithms even when correct parameter values were artificially supplied to them. The encouraging results suggest that projected clustering can be a practical tool for various kinds of real applications.
引用
收藏
页码:1387 / 1397
页数:11
相关论文
共 22 条
  • [1] AGGARWAL C, 1999, P ACM SIGMOD INT C M
  • [2] Aggarwal C. C., 2000, P ACM SIGMOD INT C M
  • [3] AGRAWAL R, 1998, P ACM SIGMOD INT C M
  • [4] Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
    Alizadeh, AA
    Eisen, MB
    Davis, RE
    Ma, C
    Lossos, IS
    Rosenwald, A
    Boldrick, JG
    Sabet, H
    Tran, T
    Yu, X
    Powell, JI
    Yang, LM
    Marti, GE
    Moore, T
    Hudson, J
    Lu, LS
    Lewis, DB
    Tibshirani, R
    Sherlock, G
    Chan, WC
    Greiner, TC
    Weisenburger, DD
    Armitage, JO
    Warnke, R
    Levy, R
    Wilson, W
    Grever, MR
    Byrd, JC
    Botstein, D
    Brown, PO
    Staudt, LM
    [J]. NATURE, 2000, 403 (6769) : 503 - 511
  • [5] [Anonymous], P ACM SIGMOD INT C M
  • [6] [Anonymous], 1994, P 20 INT C VER LARG
  • [7] [Anonymous], P ACM SIGMOD INT C M
  • [8] [Anonymous], 1998, P ACM SIGMOD INT C M
  • [9] BENDOR A, 1999, P ANN INT C COMP MOL
  • [10] Bickel P. J., 1977, MATH STAT BASIC IDEA