Best terms: an efficient feature-selection algorithm for text categorization

被引:35
作者
Fragoudis, D [1 ]
Meretakis, D
Likothanassis, S
机构
[1] Univ Patras, Comp Engn & Informat Dept, GR-26500 Patras, Greece
[2] Griffith Univ, Novartis Pharma, Basel, Switzerland
[3] Inst Comp Technol, Patras, Greece
关键词
feature selection; machine learning; text categorization;
D O I
10.1007/s10115-004-0177-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a new feature-selection algorithm for text classification, called best terms (BT). The complexity of BT is linear in respect to the number of the training-set documents and is independent from both the vocabulary size and the number of categories. We evaluate BT on two benchmark document collections, Reuters-21578 and 20-Newsgroups, using two classification algorithms, naive Bayes (NB) and support vector machines (SVM). Our experimental results, comparing BT with an extensive and representative list of feature-selection algorithms, show that (1) BT is faster than the existing feature-selection algorithms; (2) BT leads to a considerable increase in the classification accuracy of NB and SVM as measured by the F1 measure; (3) BT leads to a considerable improvement in the speed of NB and SVM; in most cases, the training time of SVM has dropped by an order of magnitude; (4) in most cases, the combination of BT with the simple, but very fast, NB algorithm leads to classification accuracy comparable with SVM while sometimes it is even more accurate.
引用
收藏
页码:16 / 33
页数:18
相关论文
共 27 条
  • [1] [Anonymous], 1997, Proceedings of the fourteenth international conference on machine learning, DOI DOI 10.1016/J.ESWA.2008.05.026
  • [2] Baker L. D., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P96, DOI 10.1145/290941.290970
  • [3] BEKKERMAN R, 2001, P SIGIR 01 24 ACM IN, P146
  • [4] Selection of relevant features and examples in machine learning
    Blum, AL
    Langley, P
    [J]. ARTIFICIAL INTELLIGENCE, 1997, 97 (1-2) : 245 - 271
  • [5] DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
  • [6] 2-9
  • [7] Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651
  • [8] FUHR N, 1991, P RIAO 91, P606
  • [9] Galavotti L, 2000, LECT NOTES COMPUT SC, V1923, P59
  • [10] Joachims T., 1998, Lecture Notes in Computer Science, P137, DOI DOI 10.1007/BFB0026683