Scalable model-based clustering by working on data summaries

被引:0
|
作者
Jin, HD [1 ]
Wong, ML [1 ]
Leung, KS [1 ]
机构
[1] Lingnan Univ, Dept Informat Syst, Hong Kong, Hong Kong, Peoples R China
来源
THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS | 2003年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. In this paper, we present a two-phase scalable model-based clustering framework: First, a large data set is summed up into sub-clusters; Then, clusters are directly generated from the summary statistics of sub-clusters by a specifically designed Expectation-Maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each sub-cluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.
引用
收藏
页码:91 / 98
页数:8
相关论文
共 50 条
  • [1] Scalable model-based clustering for large databases based on data summarization
    Jin, HD
    Wong, ML
    Leung, KS
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2005, 27 (11) : 1710 - 1719
  • [2] Scalable, balanced model-based clustering
    Zhong, S
    Ghosh, J
    PROCEEDINGS OF THE THIRD SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2003, : 71 - 82
  • [3] Model-based clustering of longitudinal data
    McNicholas, Paul D.
    Murphy, T. Brendan
    CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE, 2010, 38 (01): : 153 - 168
  • [4] Boosting for model-based data clustering
    Saffari, Amir
    Bischof, Horst
    PATTERN RECOGNITION, 2008, 5096 : 51 - 60
  • [5] Model-based clustering for longitudinal data
    De la Cruz-Mesia, Rolando
    Quintanab, Fernando A.
    Marshall, Guillermo
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (03) : 1441 - 1457
  • [6] Model-Based Clustering of Temporal Data
    El Assaad, Hani
    Same, Allou
    Govaert, Gerard
    Aknin, Patrice
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2013, 2013, 8131 : 9 - 16
  • [7] Scalable model-based cluster analysis using clustering features
    Jin, HD
    Leung, KS
    Wong, ML
    Xu, ZB
    PATTERN RECOGNITION, 2005, 38 (05) : 637 - 649
  • [8] Model-based clustering with missing not at random data
    Sportisse, Aude
    Marbac, Matthieu
    Laporte, Fabien
    Celeux, Gilles
    Boyer, Claire
    Josse, Julie
    Biernacki, Christophe
    STATISTICS AND COMPUTING, 2024, 34 (04)
  • [9] Model-based clustering and classification of functional data
    Chamroukhi, Faicel
    Nguyen, Hien D.
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2019, 9 (04)
  • [10] On model-based clustering of skewed matrix data
    Melnykov, Volodymyr
    Zhu, Xuwen
    JOURNAL OF MULTIVARIATE ANALYSIS, 2018, 167 : 181 - 194