Interaction between Feature Subset Selection Techniques and Machine Learning Classifiers for Detecting Unsolicited Emails

被引:13
作者
Trivedi, Shrawan Kumar [1 ]
Dey, Shubhamoy [1 ]
机构
[1] Indian Inst Management, Informat Syst, Indore 453556, Madhya Pradesh, India
来源
APPLIED COMPUTING REVIEW | 2014年 / 14卷 / 01期
关键词
Algorithms; Performance; Experimentation; Spam Filtering; Email spam classification; Feature selection; Probabilistic classifiers; Bayesian; Naive Bayes; Support Vector Machine; J48; Random Forest; Genetic; False Positive Rate; Classification Accuracy; F-Value;
D O I
10.1145/2600617.2600622
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Detection of the spam emails within a set of email files has become challenging task for researchers. Identification of an effective classifier is based not only on high accuracy of detection but also on low false alarm rates, and the need to use as few features as possible. In view of these challenges, this research examines the effects of using features selected by four feature subset selection methods (i.e. Genetic, Greedy Stepwise, Best First, and Rank Search) on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine, Genetic Algorithm, J48 and Random Forest. Tests were performed on three different publicly available spam email datasets: "Enron", "SpamAssassin" and "LingSpam". Results show that, Greedy Stepwise Search method is a good method for feature subset selection for spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifier both in terms of accuracy and False Positive rate. However, results of Random Forest were very close to that of Support Vector Machine. The Genetic classifier was identified as a weak classifier.(1)
引用
收藏
页码:53 / 61
页数:9
相关论文
共 22 条
[1]  
Androutsopoulos I., 2000, P WORKSH MACH LEARN, P9
[2]  
Awad W. A., 2011, International Journal of Computer Science & Information Technology, V3, P173, DOI 10.5121/ijcsit.2011.3112
[3]   Extended Bayesian information criteria for model selection with large model spaces [J].
Chen, Jiahua ;
Chen, Zehua .
BIOMETRIKA, 2008, 95 (03) :759-771
[4]   Support vector machines for spam categorization [J].
Drucker, H ;
Wu, DH ;
Vapnik, VN .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (05) :1048-1054
[5]  
Goodman J., 2007, Communications of the ACM, V50, P24, DOI 10.1145/1216016.1216017
[6]  
Haleh Vafaie, 1994, P 3 INT FUZZ SYST IN, V19
[7]  
Holland J., 1975, ADAPTATION NATURAL A, P18
[8]  
Joachims T., 1998, P ECML 98, V98
[9]   An empirical study of three machine learning methods for spam filtering [J].
Lai, Chih-Chin .
KNOWLEDGE-BASED SYSTEMS, 2007, 20 (03) :249-254
[10]  
Lewis D. D., 1998, Machine Learning: ECML-98. 10th European Conference on Machine Learning. Proceedings, P4, DOI 10.1007/BFb0026666