Clustering Generalised Instances Set Approaches for Text Classiffication

被引：1

作者：

Najadat, Hassan ^{[1
]}

Obeidat, Rasha ^{[2
]}

Hmeidi, Ismail ^{[1
]}

机构：

[1] Jordan Univ Sci & Technol, Comp Informat Syst Dept, POB 3030, Irbid 22110, Jordan

[2] Jordan Univ Sci & Technol, Comp Sci Dept, Irbid, Jordan

来源：

JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT | 2011年 / 10卷 / 01期

关键词：

Text classiffication; K-means clustering; generalised; instances set;

D O I：

10.1142/S0219649211002857

中图分类号：

G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];

学科分类号：

1205 ; 120501 ;

摘要：

This paper introduces three new text classification methods: Clustering-Based Generalised Instances Set (CB-GIS), Multilevel Clustering-Based Generalised Instances Set (MLC_GIS) and Multilevel Clustering-Based k Nearest Neighbours (MLC-kNN). These new methods aim to unify the strengths and overcome the drawbacks of the three similarity-based text classification methods, namely, kNN, centroid-based and GIS. The new methods utilise a clustering technique called spherical K-means to represent each class by a representative set of generalised instances to be used later in the classification. The CB-GIS method applies a flat clustering method while MLC-GIS and MLC-kNN apply multilevel clustering. Extensive experiments have been conducted to evaluate the new methods and compare them with kNN, centroid-based and GIS classifiers on the Reuters-21578(10) benchmark dataset. The evaluation has been performed in terms of the classification performance and the classiffication efficiency. The experimental results show that the top-performing classification method is the MLC-kNN classifier, followed by the MLC-GIS and CB-GIS classifiers. According to the best micro-averaged F1 scores, the new methods (CB-GIS, MLC-CIS, MLC-kNN) have improvements of 4.48%, 4.65% and 4.76% over kNN, 1.84%, 1.92% and 2.12% over the centroid-based and 5.26%, 5.34% and 5.45% over GIS respectively. With respect to the best macro-averaged F1 scores, the new methods (CB-GIS, MLC-CIS, MLC-kNN) have improvements of 10.29%, 10.19% and 10.45% over kNN, respectively, 0.1%, 0.03% and 0.29% over the centroid-based and 3.75%, 3.68% and 3.94% over GIS respectively.

引用

页码：91 / 107

页数：17

共 18 条

[1] Benjamin C. M. Fung, 2003, P 3 SIAM INT C DAT M
[2] DEBOLE F, 2004, P 4 INT C LANG RES E
[3] DHILLON IS, 2002, P 2002 IEEE INT C DA
[4] Fahmi I., 2004, THESIS
[5] Guo GD, 2004, LECT NOTES COMPUT SC, V2945, P559
[6] Han E-H, 2000, P C PRINC DAT MIN KN
[7] An effective method to improve kNN text classifier
Hao, Xiulan
Tao, Xiaopeng
Zhang, Chenghong
Hu, Yunfa
[J]. SNPD 2007: EIGHTH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING, AND PARALLEL/DISTRIBUTED COMPUTING, VOL 1, PROCEEDINGS, 2007, : 379 - +
[8] Hotho A., 2005, LDV FORUM, V20, P19, DOI DOI 10.1111/j.1365-2621.1978.tb09773.x
[9] Automatic textual document categorization based on generalized instance sets and a metamodel
Lam, W
Han, YQ
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2003, 25 (05) : 628 - 633
[10] Munteanu D, 2007, ANN U DUNAREA JOS GA, V1, P35

← 1 2 →