An integration of Word Net and fuzzy association rule mining for multi-label document clustering

被引:38
作者
Chen, Chun-Ling [1 ]
Tseng, Frank S. C. [2 ]
Liang, Tyne [1 ]
机构
[1] Natl Chiao Tung Univ, Dept Comp Sci, Hsinchu 300, Taiwan
[2] Natl Kaohsiung First Univ Sci & Technol, Dept Informat Management, Kaohsiung 824, Taiwan
关键词
Fuzzy association rule mining; Text mining; Document clustering; WordNet; Frequent itemsets; FREQUENT;
D O I
10.1016/j.datak.2010.08.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rapid growth of text documents, document clustering has become one of the main techniques for organizing large amount of documents into a small number of meaningful clusters. However, there still exist several challenges for document clustering, such as high dimensionality, scalability, accuracy, meaningful cluster labels, overlapping clusters, and extracting semantics from texts. In order to improve the quality of document clustering results, we propose an effective Fuzzy-based Multi-label Document Clustering (FMDC) approach that integrates fuzzy association rule mining with an existing ontology WordNet to alleviate these problems. In our approach, the key terms will be extracted from the document set, and the initial representation of all documents is further enriched by using hypernyms of WordNet in order to exploit the semantic relations between terms. Then, a fuzzy association rule mining algorithm for texts is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, each document is dispatched into more than one target cluster by referring to these candidate clusters, and then the highly similar target clusters are merged. We conducted experiments to evaluate the performance based on Classic, Re0, R8, and WebKB datasets. The experimental results proved that our approach outperforms the influential document clustering methods with higher accuracy. Therefore, our approach not only provides more general and meaningful labels for documents, but also effectively generates overlapping clusters. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:1208 / 1226
页数:19
相关论文
共 33 条
  • [1] Agrawal R., 1993, SIGMOD Record, V22, P207, DOI 10.1145/170036.170072
  • [2] ANDREWS NO, 2007, TR0735 VIRG TECH COM
  • [3] [Anonymous], 1998, Proceedings in Use of WordNet in Natural Language Processing Systems
  • [4] [Anonymous], 2002, P 8 ACM SIGKDD INT C, DOI DOI 10.1145/775047.775110
  • [5] Chen CL, 2009, LECT NOTES ARTIF INT, V5476, P147, DOI 10.1007/978-3-642-01307-2_16
  • [6] CRAVEN M, 1998, P AAAI 98
  • [7] Dave K., 2003, Proceedings of the 12th international conference on world wide web, P519, DOI DOI 10.1145/775152.775226
  • [8] Fung BCM, 2003, SIAM PROC S, P59
  • [9] Clustering data streams: Theory and practice
    Guha, S
    Meyerson, A
    Mishra, N
    Motwani, R
    O'Callaghan, L
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2003, 15 (03) : 515 - 528
  • [10] Fuzzy data mining for interesting generalized association rules
    Hong, TP
    Lin, KY
    Wang, SL
    [J]. FUZZY SETS AND SYSTEMS, 2003, 138 (02) : 255 - 269