Feature selection strategies for text categorization

被引:0
作者
Soucy, P [1 ]
Mineau, GW
机构
[1] Copern Inc, Copern Res, Quebec City, PQ, Canada
[2] Univ Laval, Dept Comp Sci, Quebec City, PQ, Canada
来源
ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS | 2003年 / 2671卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is an important research issue in text categorization. The reason for this is that thousands of features are often involved, even when the simplest document representation model, the so-called bag-of-words, is used. Among the many approaches to feature selection, the use of some scoring function to rank features to filter them out is an important one. Many of these functions are widely used in text categorization. In past feature selection studies, most researchers have focused on comparing these measures in terms of accuracy achieved. For any measure, however, there are many selection strategies that can be applied to produce the resulting feature set. In this paper, we compare some such strategies and propose a new one. Tests have been conducted to compare five selection strategies on four datasets, using three distinct classifiers and four common feature scoring functions. As a result, it is possible to better understand which strategies are suited to particular classification settings.
引用
收藏
页码:505 / 509
页数:5
相关论文
共 10 条
[1]  
[Anonymous], P 4 EUR C PRINC PRAC
[2]  
BRANK J, 2002, P 19 C MACH LEARN WO
[3]  
JOACHIMS T, 1997, 23 LS8 U DORTM
[4]  
Joachims T., 2002, LEARNING CLASSIFY TE
[5]  
Lewis D., 1997, Reuters-21578 text categorization test collection, distribution 1.0
[6]  
LEWIS DD, 1996, P 19 ANN INT ACM SIG, P298
[7]  
Mladenic D., 1998, THESIS U LJUBLJANA S
[8]  
Scott S, 1999, MACHINE LEARNING, PROCEEDINGS, P379
[9]  
Yang Y., 1999, SIGIR 99
[10]  
Yang Y., 1997, P 14 INT C MACH LEAR, V97, P412, DOI DOI 10.1016/J.ESWA.2008.05.026