Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization

被引：27

作者：

Hammouda, Khaled M. ^{[1
]}

Kamel, Mohamed S. ^{[2
]}

机构：

[1] Desire2Learn Inc, Kitchener, ON N2G 1B9, Canada

[2] Univ Waterloo, Dept Elect & Comp Engn, Waterloo, ON N2L 3G1, Canada

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2009年 / 21卷 / 05期

关键词：

Distributed data mining; distributed document clustering; hierarchical peer-to-peer networks; CATEGORIZATION;

D O I：

10.1109/TKDE.2008.189

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In distributed data mining, adopting a flat node distribution model can affect scalability. To address the problem of modularity, flexibility, and scalability, we propose a Hierarchically distributed Peer-to-Peer (HP2PC) architecture and clustering algorithm. The architecture is based on a multilayer overlay network of peer neighborhoods. Supernodes, which act as representatives of neighborhoods, are recursively grouped to form higher level neighborhoods. Within a certain level of the hierarchy, peers cooperate within their respective neighborhoods to perform P2P clustering. Using this model, we can partition the clustering problem in a modular way across neighborhoods, solve each part individually using a distributed K-means variant, then successively combine clusterings up the hierarchy where increasingly more global solutions are computed. In addition, for document clustering applications, we summarize the distributed document clusters using a distributed keyphrase extraction algorithm, thus providing interpretation of the clusters. Results show decent speedup, reaching 165 times faster than centralized clustering for a 250-node simulated network, with comparable clustering quality to the centralized approach. We also provide comparison to the P2P K-means algorithm and show that HP2PC accuracy is better for typical hierarchy heights. Results for distributed cluster summarization match those of their centralized counterparts with up to 88 percent accuracy.

引用

页码：681 / 698

页数：18

共 32 条

[1] Clustering distributed data streams in peer-to-peer environments [J].

Bandyopadhyay, Sanghamitra ;

Giannella, Chris ;

Maulik, Ujjwal ;

Kargupta, Hillol ;

Liu, Kun ;

Datta, Souptik .

INFORMATION SCIENCES, 2006, 176 (14) :1952-1985

[2] Principal direction divisive partitioning [J].

Boley, D .

DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (04) :325-344

[3] Partitioning-based clustering for Web document categorization [J].

Boley, D ;

Gini, M ;

Gross, R ;

Han, EH ;

Hastings, K ;

Karypis, G ;

Kumar, V ;

Mobasher, B ;

Moore, J .

DECISION SUPPORT SYSTEMS, 1999, 27 (03) :329-341

[4] Document categorization and query generation on the World Wide Web using WebACE [J].

Boley, D ;

Gini, M ;

Gross, R ;

Han, EH ;

Hastings, K ;

Karypis, G ;

Kumar, V ;

Mobasher, B ;

Moore, J .

ARTIFICIAL INTELLIGENCE REVIEW, 1999, 13 (5-6) :365-391

[5] Distributed data mining and agents [J].

da Silva, JC ;

Giannella, C ;

Bhargava, R ;

Kargupta, H ;

Klusch, M .

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2005, 18 (07) :791-807

[6]

DATTA S, 2005, P 8 INT WORKSH HIGH

[7]

Datta S, 2006, SIAM PROC S, P153

[8] Distributed data mining in peer-to-peer networks [J].

Datta, Souptik ;

Bhaduri, Kanishka ;

Giannella, Chris ;

Kargupta, Hillol ;

Wolff, Ran .

IEEE INTERNET COMPUTING, 2006, 10 (04) :18-26

[9] A data-clustering algorithm on distributed memory multiprocessors [J].

Dhillon, IS ;

Modha, DS .

LARGE-SCALE PARALLEL DATA MINING, 2000, 1759 :245-260

[10]

Dunn J.C., 1974, J CYBERNETICS, V3, P95, DOI [DOI 10.1080/01969727408546059, 10.1080/019697274085460590304.68093]

← 1 2 3 4 →