Advanced Analysis of Learning-Based Spam Email Filtering Methods Based on Feature Distribution Differences of Dataset

被引：0

作者：

Kim, Jin-Seong ^{[1
]}

Lee, Han-Jin ^{[1
]}

Lee, Han-Ju ^{[1
]}

Choi, Seok-Hwan ^{[1
]}

机构：

[1] Yonsei Univ, Div Software, Wonju Si 26493, Gangwon Do, South Korea

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

新加坡国家研究基金会;

关键词：

Unsolicited e-mail; Filtering; Threat modeling; Deep learning; Data models; Cleaning; Tokenization; Measurement; Biological system modeling; Accuracy; Long short term memory; Spam email filtering; recurrent neural network (RNN); gated recurrent unit (GRU); long short-term memory (LSTM); ALBERT; security; DEEP;

D O I：

10.1109/ACCESS.2024.3495830

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Spam emails, which are unsolicited bulk emails, pose a significant threat in digital communication security. To counter spam emails, learning-based spam email filtering methods have been extensively studied. However, as spam patterns evolve, these methods face challenges in maintaining the accuracy of models trained on outdated patterns. To demonstrate these limitations empirically and gain insight into the classification patterns of spam email filtering models, we propose an advanced analysis method to analyze the performance degradation of spam email filtering models. The proposed analysis method involves text preprocessing, embedding model training, spam email filtering model training, evaluation, and analysis of the classification patterns of the learning-based spam email filtering models. From the experimental results under various datasets and spam email filtering models, we show that the accuracy of spam email filtering models significantly decreases when the feature distribution of the test dataset is different from the training dataset. We also provides valuable insights for improving the model architecture, dataset structure, and training strategies by analysis of various factors such as confusion matrix, performance metrics, mean sequence length, out-of-vocabulary (OOV) rate, and top-20 tokens.

引用

页码：167313 / 167323

页数：11

共 30 条

[1] Abdal M. N., 2023, P 6 INT C EL INF COM, P1
[2] Spam Email Detection Using Deep Learning Techniques
AbdulNabi, Isra'a
Yaseen, Qussai
[J]. 12TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT) / THE 4TH INTERNATIONAL CONFERENCE ON EMERGING DATA AND INDUSTRY 4.0 (EDI40) / AFFILIATED WORKSHOPS, 2021, 184 : 853 - 858
[3] [Anonymous], 2024, Email Threat Landscape Report: Protecting Your Organization From Increased Malware, BEC, and Credential Phishing Attacks
[4] Arya Varsha, 2023, International Conference on Cyber Security, Privacy and Networking (ICSPN 2022). Lecture Notes in Networks and Systems (599), P341, DOI 10.1007/978-3-031-22018-0_31
[5] Debnath Kingshuk, 2022, 2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), P37, DOI 10.1109/COM-IT-CON54601.2022.9850588
[6] Spam filtering using a logistic regression model trained by an artificial bee colony algorithm
Dedeturk, Bilge Kagan
Akay, Bahriye
[J]. APPLIED SOFT COMPUTING, 2020, 91
[7] A Support Vector Machine based Naive Bayes Algorithm for Spam Filtering
Feng, Weimiao
Sun, Jianguo
Zhang, Liguo
Cao, Cuiling
Yang, Qing
[J]. 2016 IEEE 35TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2016,
[8] Applicability of machine learning in spam and phishing email filtering: review and approaches
Gangavarapu, Tushaar
Jaidhar, C. D.
Chanduka, Bhabesh
[J]. ARTIFICIAL INTELLIGENCE REVIEW, 2020, 53 (07) : 5019 - 5081
[9] Detecting Spam Email With Machine Learning Optimized With Bio-Inspired Metaheuristic Algorithms
Gibson, Simran
Issac, Biju
Zhang, Li
Jacob, Seibu Mary
[J]. IEEE ACCESS, 2020, 8 : 187914 - 187932
[10] A review of spam email detection: analysis of spammer strategies and the dataset shift problem
Janez-Martino, Francisco
Alaiz-Rodriguez, Rocio
Gonzalez-Castro, Victor
Fidalgo, Eduardo
Alegre, Enrique
[J]. ARTIFICIAL INTELLIGENCE REVIEW, 2023, 56 (02) : 1145 - 1173

← 1 2 3 →