Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches

被引:6
作者
Delany, Sarah Jane [2 ]
Bridge, Derek [1 ]
机构
[1] Univ Coll Cork, Cork, Ireland
[2] Dublin Inst Technol, Dublin, Ireland
关键词
spam filtering; case-based reasoning; case-base editing; case-based maintenance; feature selection; distance measures; text compression;
D O I
10.1007/s10462-007-9041-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system. Improvements in the classification time of both kinds of systems can be obtained by applying case base editing algorithms, which aim to remove noisy and redundant cases from a case base while maintaining, or even improving, generalisation accuracy. We report empirical results using the Competence-Based Editing (CBE) technique. We show that CBE removes more cases when we use the distance measure based on text compression (without significant changes in generalisation accuracy) than it does when we use the feature-based approach.
引用
收藏
页码:75 / 87
页数:13
相关论文
共 50 条
[21]   A new semantic-based feature selection method for spam filtering [J].
Mendez, Jose R. ;
Cotos-Yanez, Tomas R. ;
Ruano-Ordas, David .
APPLIED SOFT COMPUTING, 2019, 76 :89-104
[22]   A Local-Concentration-Based Feature Extraction Approach for Spam Filtering [J].
Zhu, Yuanchun ;
Tan, Ying .
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2011, 6 (02) :486-497
[23]   An enhanced algorithm for semantic-based feature reduction in spam filtering [J].
Novo-Loures, Maria ;
Pavon, Reyes ;
Laza, Rosalia ;
Mendez, Jose R. ;
Ruano-Ordas, David .
PEERJ COMPUTER SCIENCE, 2024, 10
[24]   Parameter tuning, feature selection and weight assignment of features for case-based reasoning by artificial immune system [J].
Lin, Shih-Wei ;
Chen, Shih-Chieh .
APPLIED SOFT COMPUTING, 2011, 11 (08) :5042-5052
[25]   A case-based reasoning with the feature weights derived by analytic hierarchy process for bankruptcy prediction [J].
Park, CS ;
Han, I .
EXPERT SYSTEMS WITH APPLICATIONS, 2002, 23 (03) :255-264
[26]   Software Reuse and Mass Customisation Feature Modelling vs. Case-based Reasoning [J].
Kaindl, Hermann ;
Mannion, Mike .
SPLC'18: PROCEEDINGS OF THE 22ND INTERNATIONAL SYSTEMS AND SOFTWARE PRODUCT LINE CONFERENCE, VOL 1, 2018, :304-304
[27]   A case-based reasoning system using feature scaling for computer aided breast cancer [J].
Elter, M. ;
Wittenberg, T. ;
Schulz-Wendtland, R. .
INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2007, 2 :S340-S342
[28]   Global optimization of feature weights and the number of neighbors that combine in a case-based reasoning system [J].
Ahn, Hyunchul ;
Kim, Kyoung-jae ;
Han, Ingoo .
EXPERT SYSTEMS, 2006, 23 (05) :290-301
[29]   Improving Query Results in Ontology-Based Case-Based Reasoning by Dynamic Assignment of Feature Weights [J].
Chandar, J. Navin ;
Kavitha, G. .
ADVANCED COMPUTING AND INTELLIGENT ENGINEERING, 2020, 1082 :153-162
[30]   RULE-BASED AND CASE-BASED REASONING: A COMPARISON [J].
Berka, Petr .
STRATEGIC MODELING IN MANAGEMENT, ECONOMY AND SOCIETY (IDIMT-2018), 2018, 47 :217-224