A new feature selection method for text classification

被引:8
作者
Uchyigit, Gulden [1 ]
Clark, Keith [1 ]
机构
[1] Univ London Imperial Coll Sci Technol & Med, Dept Comp, London SW7 2AZ, England
关键词
feature selection; text classification; statistical inference;
D O I
10.1142/S0218001407005466
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (x(2)) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F-1 and F-2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.
引用
收藏
页码:423 / 438
页数:16
相关论文
共 28 条
[1]  
ALMUALLIM H, 1991, PROCEEDINGS : NINTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P547
[2]  
[Anonymous], 1979, INFORM RETRIEVAL
[3]  
[Anonymous], 1997, Proceedings of the fourteenth international conference on machine learning, DOI DOI 10.1016/J.ESWA.2008.05.026
[4]  
Caropreso MF, 2001, TEXT DATABASES AND DOCUMENT MANAGEMENT: THEORY AND PRACTICE, P78
[5]   Comparison of discrimination methods for the classification of tumors using gene expression data [J].
Dudoit, S ;
Fridlyand, J ;
Speed, TP .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (457) :77-87
[6]  
Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651
[7]  
DUMAIS ST, HIERARCHICAL CLASSIF
[8]  
FANO R, 1996, TRANSMISSION INFORM
[9]  
FORMA G, 2003, J MACH LEARN RES, V3
[10]  
FUKA K, 2001, IJCAI WORKSH TEXT LE