Term frequency combined hybrid feature selection method for spam filtering

被引:14
作者
Liu, Yuanning [1 ]
Wang, Youwei [1 ]
Feng, Lizhou [1 ]
Zhu, Xiaodong [1 ]
机构
[1] Jilin Univ, 2699 Qianjin St, Changchun 130012, Jilin, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature selection; Spam filtering; Document frequency; Term frequency; Parameter optimization; CLASSIFICATION; ALGORITHM;
D O I
10.1007/s10044-014-0408-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is an important technology on improving the efficiency and accuracy of spam filtering. Among the numerous methods, document frequency-based feature selections ignore the effect of term frequency information, thus always deduce unsatisfactory results. In this paper, a hybrid method (called HBM), which combines the document frequency information and term frequency information is proposed. To maintain the category distinguishing ability of the selected features, an optimal document frequency-based feature selection (called ODFFS) is chosen; terms which are indeed discriminative will be selected by ODFFS. For the remaining features, term frequency information is considered and the terms with the highest HBM values are selected. Further, a novel method called feature subset evaluating parameter optimization (FSEPO) is proposed for parameter optimization. Experiments with support vector machine (SVM) and Na < ve Bayesian (NB) classifiers are applied on four corpora: PU1, LingSpam, SpamAssian and Trec2007. Six feature selections: information gain, Chi square, improved Gini-index, multi-class odds ratio, normalized term frequency-based discriminative power measure and comprehensively measure feature selection are compared with HBM. Experimental results show that, HBM is significantly superior to other feature selection methods on four corpora when SVM and NB are applied, respectively.
引用
收藏
页码:369 / 383
页数:15
相关论文
共 40 条
[1]  
Androutsopoulos I., 2000, P WORKSH MACH LEARN
[2]  
[Anonymous], P TREC 2007 16 TEXT
[3]  
[Anonymous], P IEEE WIC INT C WEB
[4]   Comparison of term frequency and document frequency based feature selection metrics in text categorization [J].
Azam, Nouman ;
Yao, JingTao .
EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (05) :4760-4768
[5]   Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking [J].
Bermejo, Pablo ;
de la Ossa, Luis ;
Gamez, Jose A. ;
Puerta, Jose M. .
KNOWLEDGE-BASED SYSTEMS, 2012, 25 (01) :35-44
[6]   SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation [J].
Blewitt, Marnie E. ;
Gendrel, Anne-Valerie ;
Pang, Zhenyi ;
Sparrow, Duncan B. ;
Whitelaw, Nadia ;
Craig, Jeffrey M. ;
Apedaile, Anwyn ;
Hilton, Douglas J. ;
Dunwoodie, Sally L. ;
Brockdorff, Neil ;
Kay, Graham F. ;
Whitelaw, Emma .
NATURE GENETICS, 2008, 40 (05) :663-669
[7]   Application of global optimization methods to model and feature selection [J].
Boubezoul, Abderrahmane ;
Paris, Sebastien .
PATTERN RECOGNITION, 2012, 45 (10) :3676-3686
[8]   Two novel feature selection approaches for web page classification [J].
Chen, Chih-Ming ;
Lee, Hahn-Ming ;
Chang, Yu-Jung .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (01) :260-272
[9]   Feature selection for text classification with Naive Bayes [J].
Chen, Jingnian ;
Huang, Houkuan ;
Tian, Shengfeng ;
Qu, Youli .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) :5432-5435
[10]   Improving self-organization of document collections by semantic mapping [J].
Correa, Renato Fernandes ;
Ludermir, Teresa Bernarda .
NEUROCOMPUTING, 2006, 70 (1-3) :62-69