A two-stage feature selection method for text categorization

被引:43
作者
Meng, Jiana [1 ,2 ]
Lin, Hongfei [1 ]
Yu, Yuhai [1 ,3 ]
机构
[1] Dalian Univ Technol, Dept Comp Sci & Engn, Dalian 116024, Peoples R China
[2] Dalian Nationalities Univ, Coll Sci, Dalian 116600, Peoples R China
[3] Dalian Nationalities Univ, Sch Comp Sci & Engn, Dalian 116600, Peoples R China
关键词
Feature selection; Text categorization; Latent semantic indexing; Support vector machine;
D O I
10.1016/j.camwa.2011.07.045
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
Feature selection for text categorization is a well-studied problem and its goal is to improve the effectiveness of categorization, or the efficiency of computation, or both. The system of text categorization based on traditional term-matching is used to represent the vector space model as a document; however, it needs a high dimensional space to represent the document, and does not take into account the semantic relationship between terms, which leads to a poor categorization accuracy. The latent semantic indexing method can overcome this problem by using statistically derived conceptual indices to replace the individual terms. With the purpose of improving the accuracy and efficiency of categorization, in this paper we propose a two-stage feature selection method. Firstly, we apply a novel feature selection method to reduce the dimension of terms; and then we construct a new semantic space, between terms, based on the latent semantic indexing method. Through some applications involving the spam database categorization, we find that our two-stage feature selection method performs better. (C) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:2793 / 2800
页数:8
相关论文
共 15 条
[1]  
[Anonymous], 1997, ICML
[2]  
[Anonymous], 2002, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
[3]   Feature selection for text classification with Naive Bayes [J].
Chen, Jingnian ;
Huang, Houkuan ;
Tian, Shengfeng ;
Qu, Youli .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) :5432-5435
[4]   Spam! [J].
Cranor, LF ;
LaMacchia, BA .
COMMUNICATIONS OF THE ACM, 1998, 41 (08) :74-83
[5]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[6]  
2-9
[7]  
Forman G., 2003, Journal of Machine Learning Research, V3, P1289, DOI 10.1162/153244303322753670
[8]   Feature selection: Evaluation, application, and small sample performance [J].
Jain, A ;
Zongker, D .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (02) :153-158
[9]  
Joachims T, 1999, ADVANCES IN KERNEL METHODS, P169
[10]  
Mladenic D, 1999, MACHINE LEARNING, PROCEEDINGS, P258