Spam Detection Using Feature Selection and Parameters Optimization

被引:37
作者
Lee, Sang Min [1 ]
Kim, Dong Seong [2 ]
Kim, Ji Ho [1 ]
Park, Jong Sou [1 ]
机构
[1] Korea Aerosp Univ, Dept Comp Engn, Seoul, South Korea
[2] Duke Univ, Dept Elect & Comp Engn, Durham, NC USA
来源
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPLEX, INTELLIGENT AND SOFTWARE INTENSIVE SYSTEMS (CISIS 2010) | 2010年
关键词
Feature Selection; Intrusion Detection; Parameters Optimization; Random Forests; Spam Detection; Spambase;
D O I
10.1109/CISIS.2010.116
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Spam is no more garbage but risk since it recently includes virus attachments and spyware agents which make the recipients' system ruined, therefore, there is an emerging need for spam detection. Many spam detection techniques based on machine learning algorithms have been proposed. As the amount of spam has been increased tremendously using bulk mailing tools, spam detection techniques should deal with it. For spam detection, parameters optimization and feature selection have been proposed to reduce processing overheads with guaranteeing high detection rates. However, the previous approaches have not takein into account variable importance and optimal number of features and there are no approaches using both of them together so far. In this paper, we propose an optimal spam detection model based on Random Forests (RF) which enables parameters optimization and feature selection. We optimize two parameters of RF to maximize the detection rates. We provide the variable importance of each feature so that it is easy to eliminate the irrelevant features. Furthermore, we decide an optimal number of selected features using two methods; (i) only one parameters optimization during overall feature selection, (ii) parameters optimization in every feature elimination phase. We carry out experiments on the Spambase dataset and show the feasibility of our approach.
引用
收藏
页码:883 / 888
页数:6
相关论文
共 21 条
[1]   Bayesian additive regression trees-based spam detection for enhanced email privacy [J].
Abu-Nimeh, Saeed ;
Nappa, Dario ;
Wang, Xinlei ;
Nair, Suku .
ARES 2008: PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON AVAILABILITY, SECURITY AND RELIABILITY, 2008, :1044-1051
[2]  
Androutsopoulos I., 2000, Proceedings of the Workshop on Machine Learning in the New Information Age, P9
[3]  
[Anonymous], 2001, Pattern Classification
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Transforming supervised classifiers for feature extraction [J].
Bursteinas, B ;
Long, JA .
12TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2000, :274-280
[6]  
CARRERAS X, 2001, P 4 INT C REC ADV NA
[7]   Spam! [J].
Cranor, LF ;
LaMacchia, BA .
COMMUNICATIONS OF THE ACM, 1998, 41 (08) :74-83
[8]   Support vector machines for spam categorization [J].
Drucker, H ;
Wu, DH ;
Vapnik, VN .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (05) :1048-1054
[9]  
Fontana P., 2008, COMBINATION DECISION
[10]  
Graham P., 2003, P 1 ANN SPAM C JAN