Binary PSO with mutation operator for feature selection using decision tree applied to spam detection

被引:325
作者
Zhang, Yudong [1 ]
Wang, Shuihua [1 ,2 ]
Phillips, Preetha [3 ]
Ji, Genlin [1 ]
机构
[1] Nanjing Normal Univ, Sch Comp Sci & Technol, Nanjing 210023, Jiangsu, Peoples R China
[2] Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210046, Jiangsu, Peoples R China
[3] Shepherd Univ, Sch Nat Sci & Math, Shepherdstown, WV 25443 USA
基金
中国国家自然科学基金;
关键词
Spam detection; Binary Particle Swarm Optimization; Mutation operator; Feature selection; Wrapper; Premature convergence; Decision tree; Cost matrix; SUPPORT VECTOR MACHINE; CLASSIFICATION; PERFORMANCE; OPTIMIZATION; CLASSIFIERS; ALGORITHM; FRAMEWORK; SYSTEM; C4.5;
D O I
10.1016/j.knosys.2014.03.015
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we proposed a novel spam detection method that focused on reducing the false positive error of mislabeling nonspam as spam. First, we used the wrapper-based feature selection method to extract crucial features. Second, the decision tree was chosen as the classifier model with C4.5 as the training algorithm. Third, the cost matrix was introduced to give different weights to two error types, i.e., the false positive and the false negative errors. We define the weight parameter as a to adjust the relative importance of the two error types. Fourth, K-fold cross validation was employed to reduce out-of-sample error. Finally, the binary PSO with mutation operator (MBPSO) was used as the subset search strategy. Our experimental dataset contains 6000 emails, which were collected during the year of 2012. We conducted a Kolmogorov-Smirnov hypothesis test on the capital-run-length related features and found that all the p values were less than 0.001. Afterwards, we found alpha = 7 was the most appropriate in our model. Among seven meta-heuristic algorithms, we demonstrated the MBPSO is superior to GA, RSA, PSO, and BPSO in terms of classification performance. The sensitivity, specificity, and accuracy of the decision tree with feature selection by MBPSO were 91.02%, 97.51%, and 94.27%, respectively. We also compared the MBPSO with conventional feature selection methods such as SFS and SBS. The results showed that the MBPSO performs better than SFS and SBS. We also demonstrated that wrappers are more effective than filters with regard to classification performance indexes. It was clearly shown that the proposed method is effective, and it can reduce the false positive error without compromising the sensitivity and accuracy values. (c) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:22 / 31
页数:10
相关论文
共 48 条
[1]   Frequent approximate subgraphs as features for graph-based image classification [J].
Acosta-Mendoza, Niusvel ;
Gago-Alonso, Andres ;
Medina-Pagola, Jose E. .
KNOWLEDGE-BASED SYSTEMS, 2012, 27 :381-392
[2]  
Bouabda R., 2011, 2011 4th International Conference on Logistics (LOGISTIQUA), P526, DOI 10.1109/LOGISTIQUA.2011.5939454
[3]   A discrete mixture-based kernel for SVMs: Application to spam and image categorization [J].
Bouguila, Nizar ;
Amayri, Ola .
INFORMATION PROCESSING & MANAGEMENT, 2009, 45 (06) :631-642
[4]   Using J-pruning to reduce overfitting in classification trees [J].
Bramer, M .
KNOWLEDGE-BASED SYSTEMS, 2002, 15 (5-6) :301-308
[5]   A GA-based feature selection approach with an application to handwritten character recognition [J].
De Stefano, C. ;
Fontanella, F. ;
Marrocco, C. ;
di Freca, A. Scotto .
PATTERN RECOGNITION LETTERS, 2014, 35 :130-141
[6]   Spam detection using Random Boost [J].
DeBarr, Dave ;
Wechsler, Harry .
PATTERN RECOGNITION LETTERS, 2012, 33 (10) :1237-1244
[7]   A case-based technique for tracking concept drift in spam filtering [J].
Delany, SJ ;
Cunningham, P ;
Tsymbal, A ;
Coyle, L .
KNOWLEDGE-BASED SYSTEMS, 2005, 18 (4-5) :187-195
[8]   SpamHunting:: An instance-based reasoning system for spam labelling and filtering [J].
Fdez-Riverola, F. ;
Iglesias, E. L. ;
Diaz, F. ;
Mendez, J. R. ;
Corchado, J. M. .
DECISION SUPPORT SYSTEMS, 2007, 43 (03) :722-736
[9]   Identification of SPAM messages using an approach inspired on the immune system [J].
Guzella, T. S. ;
Mota-Santos, T. A. ;
Uchoa, J. Q. ;
Caminhas, W. M. .
BIOSYSTEMS, 2008, 92 (03) :215-225
[10]   CALA: An unsupervised URL-based web page classification system [J].
Hernandez, Inma ;
Rivero, Carlos R. ;
Ruiz, David ;
Corchuelo, Rafael .
KNOWLEDGE-BASED SYSTEMS, 2014, 57 :168-180