A review of spam email detection: analysis of spammer strategies and the dataset shift problem

被引:34
作者
Janez-Martino, Francisco [1 ,2 ]
Alaiz-Rodriguez, Rocio [1 ,2 ]
Gonzalez-Castro, Victor [1 ,2 ]
Fidalgo, Eduardo [1 ,2 ]
Alegre, Enrique [1 ,2 ]
机构
[1] Univ Leon, Dept Elect Syst & Automat, Leon, Spain
[2] INCIBE Spanish Natl Cybersecur Inst, Leon, Spain
关键词
Spam email detection; Dataset shift; Adversarial machine learning; Spammer strategies; Feature selection; CONCEPT DRIFT; FEATURE-SELECTION; CLASSIFICATION; PATTERNS;
D O I
10.1007/s10462-022-10195-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spam emails have been traditionally seen as just annoying and unsolicited emails containing advertisements, but they increasingly include scams, malware or phishing. In order to ensure the security and integrity for the users, organisations and researchers aim to develop robust filters for spam email detection. Recently, most spam filters based on machine learning algorithms published in academic journals report very high performance, but users are still reporting a rising number of frauds and attacks via spam emails. Two main challenges can be found in this field: (a) it is a very dynamic environment prone to the dataset shift problem and (b) it suffers from the presence of an adversarial figure, i.e. the spammer. Unlike classical spam email reviews, this one is particularly focused on the problems that this constantly changing environment poses. Moreover, we analyse the different spammer strategies used for contaminating the emails, and we review the state-of-the-art techniques to develop filters based on machine learning. Finally, we empirically evaluate and present the consequences of ignoring the matter of dataset shift in this practical field. Experimental results show that this shift may lead to severe degradation in the estimated generalisation performance, with error rates reaching values up to 48.81%.
引用
收藏
页码:1145 / 1173
页数:29
相关论文
共 120 条
[1]  
Al Nabki MW, 2017, 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, P35
[2]   File Name Classification Approach to Identify Child Sexual Abuse [J].
Al-Nabki, Mhd Wesam ;
Fidalgo, Eduardo ;
Alegre, Enrique ;
Alaiz-Rodriguez, Rocio .
ICPRAM: PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS, 2020, :228-234
[3]  
Alaiz-Rodríguez R, 2008, LECT NOTES ARTIF INT, V5032, P13
[4]  
Alazab M, 2016, TRENDS ISS CRIME CRI
[5]  
Androutsopoulos I, 2000, LEARNING FILTER SPAM, P112
[6]   Image spam analysis and detection [J].
Annadatha, Annapurna ;
Stamp, Mark .
JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES, 2018, 14 (01) :39-52
[7]  
[Anonymous], 2016, International Journal of Network Security & Its Applications
[8]   Addressing Adversarial Attacks Against Security Systems Based on Machine Learning [J].
Apruzzese, Giovanni ;
Colajanni, Michele ;
Ferretti, Luca ;
Marchetti, Mirco .
2019 11TH INTERNATIONAL CONFERENCE ON CYBER CONFLICT (CYCON): SILENT BATTLE, 2019, :383-400
[9]   Malware traffic classification using principal component analysis and artificial neural network for extreme surveillance [J].
Arivudainambi, D. ;
Kumar, Varun K. A. ;
Chakkaravarthy, Sibi S. ;
Visu, P. .
COMPUTER COMMUNICATIONS, 2019, 147 :50-57
[10]  
Baena-Garcia M., 2006, ECML/PKDD Workshop on Knowledge Discovery from Data Streams, V6, P77