Term frequency combined hybrid feature selection method for spam filtering

被引:0
作者
Yuanning Liu
Youwei Wang
Lizhou Feng
Xiaodong Zhu
机构
[1] Jilin University,
来源
Pattern Analysis and Applications | 2016年 / 19卷
关键词
Feature selection; Spam filtering; Document frequency; Term frequency; Parameter optimization;
D O I
暂无
中图分类号
学科分类号
摘要
Feature selection is an important technology on improving the efficiency and accuracy of spam filtering. Among the numerous methods, document frequency-based feature selections ignore the effect of term frequency information, thus always deduce unsatisfactory results. In this paper, a hybrid method (called HBM), which combines the document frequency information and term frequency information is proposed. To maintain the category distinguishing ability of the selected features, an optimal document frequency-based feature selection (called ODFFS) is chosen; terms which are indeed discriminative will be selected by ODFFS. For the remaining features, term frequency information is considered and the terms with the highest HBM values are selected. Further, a novel method called feature subset evaluating parameter optimization (FSEPO) is proposed for parameter optimization. Experiments with support vector machine (SVM) and Naïve Bayesian (NB) classifiers are applied on four corpora: PU1, LingSpam, SpamAssian and Trec2007. Six feature selections: information gain, Chi square, improved Gini-index, multi-class odds ratio, normalized term frequency-based discriminative power measure and comprehensively measure feature selection are compared with HBM. Experimental results show that, HBM is significantly superior to other feature selection methods on four corpora when SVM and NB are applied, respectively.
引用
收藏
页码:369 / 383
页数:14
相关论文
共 76 条
  • [1] Azam N(2012)Comparison of term frequency and document frequency based feature selection metrics in text categorization Expert Syst Appl 39 4760-4768
  • [2] Yao J(2012)Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking Knowl-Based Syst 25 35-44
  • [3] Bermejo P(2012)Application of global optimization methods to model and feature selection Pattern Recogn 45 3676-3686
  • [4] Ossa L(2009)Two novel feature selection approaches for web page classification Expert Syst Appl 36 260-272
  • [5] Gámez JA(2009)Feature selection for text classification with Naïve Bayes Expert Syst Appl 36 5432-5435
  • [6] Puerta JM(2006)Improving self-organization of document collections by semantic mapping Neurocomputing 70 62-69
  • [7] Boubezoul A(2006)An introduction to ROC analysis Pattern Recogn Lett 27 861-874
  • [8] Paris S(2012)PCA document reconstruction for email classification Comput Stat Data Anal 56 741-751
  • [9] Chen CM(2009)A review of machine learning approaches to spam filtering Expert Syst Appl 36 10206-10222
  • [10] Lee HM(2006)Information gain and divergence-based feature selection for machine learning-based text categorization Inf Process Manag 42 155-165