Clustering Generalised Instances Set Approaches for Text Classiffication

被引:1
作者
Najadat, Hassan [1 ]
Obeidat, Rasha [2 ]
Hmeidi, Ismail [1 ]
机构
[1] Jordan Univ Sci & Technol, Comp Informat Syst Dept, POB 3030, Irbid 22110, Jordan
[2] Jordan Univ Sci & Technol, Comp Sci Dept, Irbid, Jordan
关键词
Text classiffication; K-means clustering; generalised; instances set;
D O I
10.1142/S0219649211002857
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
This paper introduces three new text classification methods: Clustering-Based Generalised Instances Set (CB-GIS), Multilevel Clustering-Based Generalised Instances Set (MLC_GIS) and Multilevel Clustering-Based k Nearest Neighbours (MLC-kNN). These new methods aim to unify the strengths and overcome the drawbacks of the three similarity-based text classification methods, namely, kNN, centroid-based and GIS. The new methods utilise a clustering technique called spherical K-means to represent each class by a representative set of generalised instances to be used later in the classification. The CB-GIS method applies a flat clustering method while MLC-GIS and MLC-kNN apply multilevel clustering. Extensive experiments have been conducted to evaluate the new methods and compare them with kNN, centroid-based and GIS classifiers on the Reuters-21578(10) benchmark dataset. The evaluation has been performed in terms of the classification performance and the classiffication efficiency. The experimental results show that the top-performing classification method is the MLC-kNN classifier, followed by the MLC-GIS and CB-GIS classifiers. According to the best micro-averaged F1 scores, the new methods (CB-GIS, MLC-CIS, MLC-kNN) have improvements of 4.48%, 4.65% and 4.76% over kNN, 1.84%, 1.92% and 2.12% over the centroid-based and 5.26%, 5.34% and 5.45% over GIS respectively. With respect to the best macro-averaged F1 scores, the new methods (CB-GIS, MLC-CIS, MLC-kNN) have improvements of 10.29%, 10.19% and 10.45% over kNN, respectively, 0.1%, 0.03% and 0.29% over the centroid-based and 3.75%, 3.68% and 3.94% over GIS respectively.
引用
收藏
页码:91 / 107
页数:17
相关论文
共 18 条
  • [1] Benjamin C. M. Fung, 2003, P 3 SIAM INT C DAT M
  • [2] DEBOLE F, 2004, P 4 INT C LANG RES E
  • [3] DHILLON IS, 2002, P 2002 IEEE INT C DA
  • [4] Fahmi I., 2004, THESIS
  • [5] Guo GD, 2004, LECT NOTES COMPUT SC, V2945, P559
  • [6] Han E-H, 2000, P C PRINC DAT MIN KN
  • [7] An effective method to improve kNN text classifier
    Hao, Xiulan
    Tao, Xiaopeng
    Zhang, Chenghong
    Hu, Yunfa
    [J]. SNPD 2007: EIGHTH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING, AND PARALLEL/DISTRIBUTED COMPUTING, VOL 1, PROCEEDINGS, 2007, : 379 - +
  • [8] Hotho A., 2005, LDV FORUM, V20, P19, DOI DOI 10.1111/j.1365-2621.1978.tb09773.x
  • [9] Automatic textual document categorization based on generalized instance sets and a metamodel
    Lam, W
    Han, YQ
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2003, 25 (05) : 628 - 633
  • [10] Munteanu D, 2007, ANN U DUNAREA JOS GA, V1, P35