A new semantic-based feature selection method for spam filtering

被引:56
作者
Mendez, Jose R. [1 ,2 ]
Cotos-Yanez, Tomas R. [2 ,3 ]
Ruano-Ordas, David [1 ,2 ]
机构
[1] Univ Vigo, Dept Comp Sci, ESEI, Campus Lagoas, Orense 32004, Spain
[2] Ctr Singular Invest Galicia, Ctr Invest Biomed, Campus Univ Lagoas Marcosende, Vigo 36310, Spain
[3] ESEI, Dept Stat & Operat Res, Campus Lagoas, Orense 32004, Spain
关键词
Feature selection methods; Text mining; Spam filtering; e-mail; Classification; Machine learning; SUPPORT VECTOR MACHINES; CONCEPT DRIFT; CLASSIFICATION; ALGORITHMS; REDUCTION;
D O I
10.1016/j.asoc.2018.12.008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Internet emerged as a powerful infrastructure for the worldwide communication and interaction of people. Some unethical uses of this technology (for instance spam or viruses) generated challenges in the development of mechanisms to guarantee an affordable and secure experience concerning its usage. This study deals with the massive delivery of unwanted content or advertising campaigns without the accordance of target users (also known as spam). Currently, words (tokens) are selected by using feature selection schemes; they are then used to create feature vectors for training different Machine Learning (ML) approaches. This study introduces a new feature selection method able to take advantage of a semantic ontology to group words into topics and use them to build feature vectors. To this end, we have compared the performance of nine well-known ML approaches in conjunction with (i) Information Gain, the most popular feature selection method in the spam-filtering domain and (ii) Latent Dirichlet Allocation, a generative statistical model that allows sets of observations to be explained by unobserved groups that describe why some parts of the data are similar, and (iii) our semantic-based feature selection proposal. Results have shown the suitability and additional benefits of topic-driven methods to develop and deploy high-performance spam filters. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:89 / 104
页数:16
相关论文
共 107 条
[51]   A review of machine learning approaches to Spam filtering [J].
Guzella, Thiago S. ;
Caminhas, Walmir M. .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (07) :10206-10222
[52]  
Halder S, 2011, 2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), P104, DOI 10.1109/IRI.2011.6009529
[53]   Big Data: Theoretical Aspects [J].
Haykin, Simon ;
Wright, Stephen ;
Bengio, Yoshua .
PROCEEDINGS OF THE IEEE, 2016, 104 (01) :8-10
[54]   Probabilistic latent semantic indexing [J].
Hofmann, T .
SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, :50-57
[55]   Open-source machine learning: R meets Weka [J].
Hornik, Kurt ;
Buchta, Christian ;
Zeileis, Achim .
COMPUTATIONAL STATISTICS, 2009, 24 (02) :225-232
[56]   A two-stage Markov blanket based feature selection algorithm for text classification [J].
Javed, Kashif ;
Maruf, Sameen ;
Babri, Haroon A. .
NEUROCOMPUTING, 2015, 157 :91-104
[57]   INFORMATION THEORY AND STATISTICAL MECHANICS [J].
JAYNES, ET .
PHYSICAL REVIEW, 1957, 106 (04) :620-630
[58]   Stability of feature selection algorithms: a study on high-dimensional spaces [J].
Kalousis, Alexandros ;
Prados, Julien ;
Hilario, Melanie .
KNOWLEDGE AND INFORMATION SYSTEMS, 2007, 12 (01) :95-116
[59]  
Kitterman S., 2014, Sender Policy Framework (SPF) for Authorizing Use of Domains in Email, Version 1, in RFC 7208
[60]   Wrappers for feature subset selection [J].
Kohavi, R ;
John, GH .
ARTIFICIAL INTELLIGENCE, 1997, 97 (1-2) :273-324