Distributed clustering of categorical data using the information bottleneck framework

被引:5
|
作者
Tagasovska, Natasa [1 ]
Andritsos, Periklis [2 ]
机构
[1] Univ Lausanne, HEC, Dept Informat Syst, CH-1015 Lausanne, Switzerland
[2] Univ Toronto, Fac Informat, 140 St George St, Toronto, ON M5S 3G6, Canada
关键词
Distributed clustering; Categorical data; Information Bottleneck;
D O I
10.1016/j.is.2017.10.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We perform clustering of categorical data using the Information Bottleneck, (IB), framework at large scale. We examine the performance of existing solutions using multiple machine architectures. The IB method uses information theory to recast database relations as probability distributions and the proximity of their tuples as their loss of information when they are considered together. More precisely, we study the Agglomerative Information Bottleneck, the Sequential Information Bottleneck and LIMBO, a newer approach that uses summaries of the original data. First we evaluate the performance and limitations of these algorithms when confronted with large datasets in a single, powerful machine. We then propose new implementations that take advantage of distributed environments. Using real and large synthetic datasets of tens of Gigabytes in size, we finally evaluate their effectiveness and efficiency. (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:161 / 178
页数:18
相关论文
共 50 条
  • [1] Coercion: A Distributed Clustering Algorithm for Categorical Data
    Wang, Bin
    Zhou, Yang
    Hei, Xinhong
    2013 9TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2013, : 683 - 687
  • [2] Hierarchical division clustering framework for categorical data
    Wei, Wei
    Liang, Jiye
    Guo, Xinyao
    Song, Peng
    Sun, Yijun
    NEUROCOMPUTING, 2019, 341 : 118 - 134
  • [3] A categorical data clustering framework on graph representation
    Bai, Liang
    Liang, Jiye
    PATTERN RECOGNITION, 2022, 128
  • [4] A bi-clustering framework for categorical data
    Pensa, RG
    Robardet, C
    Boulicaut, JF
    KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005, 2005, 3721 : 643 - 650
  • [5] INCREMENTAL CLUSTERING USING INFORMATION BOTTLENECK THEORY
    Liu, Yongli
    Ouyang, Yuanxin
    Xiong, Zhang
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2011, 25 (05) : 695 - 712
  • [6] Geometric clustering using the information bottleneck method
    Still, S
    Bialek, W
    Bottou, L
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 16, 2004, 16 : 1165 - 1172
  • [7] A Framework for Clustering Massive Text and Categorical Data Streams
    Aggarwal, Charu C.
    Yu, Philip S.
    PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2006, : 479 - 483
  • [8] A Framework for Clustering Categorical Time-Evolving Data
    Cao, Fuyuan
    Liang, Jiye
    Bai, Liang
    Zhao, Xingwang
    Dang, Chuangyin
    IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2010, 18 (05) : 872 - 882
  • [9] Data clustering by Markovian relaxation and the Information Bottleneck Method
    Tishby, N
    Slonim, N
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 13, 2001, 13 : 640 - 646
  • [10] Incremental Clustering for Categorical Data Using Clustering Ensemble
    Li Taoying
    Chne Yan
    Qu Lili
    Mu Xiangwei
    PROCEEDINGS OF THE 29TH CHINESE CONTROL CONFERENCE, 2010, : 2519 - 2524