Rough sets for spam filtering: Selecting appropriate decision rules for boundary e-mail classification

被引:27
作者
Perez-Diaz, Noemi [1 ]
Ruano-Ordas, David [1 ]
Mendez, Jose R. [1 ]
Galvez, Juan F. [1 ]
Fdez-Riverola, Florentino [1 ]
机构
[1] Univ Vigo, ESEI Escuela Super Ingn Informat, Orense 32004, Spain
关键词
Spam classification; Rough sets; Rule execution schemes; Content-based techniques; Model evaluation; PERFORMANCE;
D O I
10.1016/j.asoc.2012.05.024
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, spam represents an extensive subset of the information delivered through Internet involving all unsolicited and disturbing communications received while using different services including e-mail, weblogs and forums. In this context, this paper reviews and brings together previous approaches and novel alternatives for applying rough set (RS) theory to the spam filtering domain by defining three different rule execution schemes: MFD (most frequent decision), LNO (largest number of objects) and LTS (largest total strength). With the goal of correctly assessing the suitability of the proposed algorithms, we specifically address and analyse significant questions for appropriate model validation like corpus selection, preprocessing and representational issues, as well as different specific benchmarking measures. From the experiments carried out using several execution schemes for selecting appropriate decision rules generated by rough sets, we conclude that the proposed algorithms can outperform other well-known anti-spam filtering techniques such as support vector machines (SVM), Adaboost and different types of Bayes classifiers. (c) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:3671 / 3682
页数:12
相关论文
共 47 条
  • [1] Androutsopoulos I., 2006, 3 C EM ANT CEAS
  • [2] [Anonymous], 2008, RFC 5321
  • [3] [Anonymous], JGC SPAM ANTISPAM NE
  • [4] Baeza-Yates R, 1999, MODERN INFORM RETRIE, V463
  • [5] A survey and experimental evaluation of image spam filtering techniques
    Biggio, Battista
    Fumera, Giorgio
    Pillai, Ignazio
    Roli, Fabio
    [J]. PATTERN RECOGNITION LETTERS, 2011, 32 (10) : 1436 - 1446
  • [6] Bueno P., 2010, MCAFEE THREATS REPOR
  • [7] Carreras X., 2001, P 4 INT C RECENT ADV, P58
  • [8] Chiu Y., 2007, P 3 INT C NAT COMP
  • [9] A case-based technique for tracking concept drift in spam filtering
    Delany, SJ
    Cunningham, P
    Tsymbal, A
    Coyle, L
    [J]. KNOWLEDGE-BASED SYSTEMS, 2005, 18 (4-5) : 187 - 195
  • [10] Support vector machines for spam categorization
    Drucker, H
    Wu, DH
    Vapnik, VN
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1999, 10 (05): : 1048 - 1054