An effective web document clustering algorithm based on bisection and merge

被引:0
|
作者
Ingyu Lee
Byung-Won On
机构
[1] Troy University,Sorrell College of Business
[2] Advanced Digital Sciences Center,undefined
来源
关键词
Clustering; Spectral bisection; Entity resolution; Data mining;
D O I
暂无
中图分类号
学科分类号
摘要
To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.
引用
收藏
页码:69 / 85
页数:16
相关论文
共 50 条
  • [1] An effective web document clustering algorithm based on bisection and merge
    Lee, Ingyu
    On, Byung-Won
    ARTIFICIAL INTELLIGENCE REVIEW, 2011, 36 (01) : 69 - 85
  • [2] A fuzzy-based algorithm for Web document clustering
    Friedman, M
    Kandel, A
    Schneider, M
    Last, M
    Shapira, B
    Elovici, Y
    Zaafrany, O
    NAFIPS 2004: ANNUAL MEETING OF THE NORTH AMERICAN FUZZY INFORMATION PROCESSING SOCIETY, VOLS 1AND 2: FUZZY SETS IN THE HEART OF THE CANADIAN ROCKIES, 2004, : 524 - 527
  • [3] Clustering algorithm based on swarm intelligence for Web document
    Wu, Bin
    Fu, Wei-Peng
    Zheng, Yi
    Liu, Shao-Hui
    Shi, Zhong-Zhi
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2002, 39 (11):
  • [4] A web document clustering algorithm based on concept of neighbor
    Song, JC
    Shen, JY
    2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, : 46 - 50
  • [5] An improved clustering algorithm for web document
    Wang, Jing
    Liu, Zhijing
    Journal of Information and Computational Science, 2009, 6 (02): : 959 - 966
  • [6] A co-clustering algorithm based on structured Web document
    Deng, Dong-Mei
    Long, Ji-Zhen
    Yin, Xiang-Zhou
    Zhongnan Daxue Xuebao (Ziran Kexue Ban)/Journal of Central South University (Science and Technology), 2010, 41 (05): : 1871 - 1876
  • [7] AN EFFECTIVE FUZZY CLUSTERING ALGORITHM FOR WEB DOCUMENT CLASSIFICATION: A CASE STUDY IN CULTURAL CONTENT MINING
    Tsekouras, George E.
    Gavalas, Damianos
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2013, 23 (06) : 869 - 886
  • [8] K-means algorithm based on particle swarm optimization for web document clustering
    Xiao, L. Z.
    Shao, Z. Q.
    Gu, X. M.
    DYNAMICS OF CONTINUOUS DISCRETE AND IMPULSIVE SYSTEMS-SERIES B-APPLICATIONS & ALGORITHMS, 2006, 13E : 980 - 984
  • [9] A Document Clustering Algorithm for Web Search Engine Retrieval System
    Yang, Hongwei
    2010 INTERNATIONAL CONFERENCE ON E-EDUCATION, E-BUSINESS, E-MANAGEMENT AND E-LEARNING: IC4E 2010, PROCEEDINGS, 2010, : 383 - 386
  • [10] A projection-based split-and-merge clustering algorithm
    Cheng, Mingchang
    Ma, Tiefeng
    Liu, Youbo
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 116 : 121 - 130