Mining fuzzy frequent itemsets for hierarchical document clustering

被引：34

作者：

Chen, Chun-Ling ^{[1
]}

Tseng, Frank S. C. ^{[2
]}

Liang, Tyne ^{[1
]}

机构：

[1] Natl Chiao Tung Univ, Dept Comp Sci, Hsinchu 300, Taiwan

[2] Natl Kaohsiung First Univ Sci & Technol, Dept Informat Management, Yenchao 824, Kaoshiung, Taiwan

来源：

INFORMATION PROCESSING & MANAGEMENT | 2010年 / 46卷 / 02期

关键词：

Fuzzy association rule mining; Text mining; Hierarchical document clustering; Frequent itemsets;

D O I：

10.1016/j.ipm.2009.09.009

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

As text documents are explosively increasing in the Internet, the process of hierarchical document clustering has been proven to be useful for grouping similar documents for versatile applications. However, most document clustering methods still suffer from challenges in dealing with the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels. In this paper, we will present an effective Fuzzy Frequent Item-set-Based Hierarchical Clustering ((FIHC)-I-2) approach, which uses fuzzy association rule mining algorithm to improve the clustering accuracy of Frequent Item-set-Based Hierarchical Clustering (FIHC) method, In our approach, the key terms will be extracted from the document set, and each document is pre-processed into the designated representation for the following mining process. Then, a fuzzy association rule mining algorithm for text is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, these documents will be clustered into a hierarchical cluster tree by referring to these candidate clusters. We have conducted experiments to evaluate the performance based on Classic4, Hitech, ReO, Reuters, and Wap datasets. The experimental results show that our approach not only absolutely retains the merits of FIHC, but also improves the accuracy quality of FIHC. Crown Copyright (C) 2009 Published by Elsevier Ltd. All rights reserved.

引用

页码：193 / 211

页数：19

共 29 条

[1] [Anonymous], 1971, The SMART Retrieval System-Experiments in Automatic Document Processing
[2] [Anonymous], P SEM WEB WORKSH 26
[3] [Anonymous], 2002, P 8 ACM SIGKDD INT C, DOI DOI 10.1145/775047.775110
[4] Bellot P., 1999, A clustering method for information retrieval
[5] CHEN CL, 2008, P 3 INT C INN COMP I, P326
[6] Delgado M., 2002, Pattern Detection and Discovery. ESF Exploratory Workshop Proceedings (Lecture Notes in Artificial Intelligence Vol. 2447), P140
[7] Feldman R., 1995, Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), P112
[8] Fung BCM, 2003, SIAM PROC S, P59
[9] FUNG BCM, 2002, THESIS S FRASER U
[10] Clustering data streams: Theory and practice
Guha, S
Meyerson, A
Mishra, N
Motwani, R
O'Callaghan, L
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2003, 15 (03) : 515 - 528

← 1 2 3 →