Advanced Analysis of Learning-Based Spam Email Filtering Methods Based on Feature Distribution Differences of Dataset

被引:0
作者
Kim, Jin-Seong [1 ]
Lee, Han-Jin [1 ]
Lee, Han-Ju [1 ]
Choi, Seok-Hwan [1 ]
机构
[1] Yonsei Univ, Div Software, Wonju Si 26493, Gangwon Do, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Unsolicited e-mail; Filtering; Threat modeling; Deep learning; Data models; Cleaning; Tokenization; Measurement; Biological system modeling; Accuracy; Long short term memory; Spam email filtering; recurrent neural network (RNN); gated recurrent unit (GRU); long short-term memory (LSTM); ALBERT; security; DEEP;
D O I
10.1109/ACCESS.2024.3495830
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Spam emails, which are unsolicited bulk emails, pose a significant threat in digital communication security. To counter spam emails, learning-based spam email filtering methods have been extensively studied. However, as spam patterns evolve, these methods face challenges in maintaining the accuracy of models trained on outdated patterns. To demonstrate these limitations empirically and gain insight into the classification patterns of spam email filtering models, we propose an advanced analysis method to analyze the performance degradation of spam email filtering models. The proposed analysis method involves text preprocessing, embedding model training, spam email filtering model training, evaluation, and analysis of the classification patterns of the learning-based spam email filtering models. From the experimental results under various datasets and spam email filtering models, we show that the accuracy of spam email filtering models significantly decreases when the feature distribution of the test dataset is different from the training dataset. We also provides valuable insights for improving the model architecture, dataset structure, and training strategies by analysis of various factors such as confusion matrix, performance metrics, mean sequence length, out-of-vocabulary (OOV) rate, and top-20 tokens.
引用
收藏
页码:167313 / 167323
页数:11
相关论文
共 30 条
  • [1] Abdal M. N., 2023, P 6 INT C EL INF COM, P1
  • [2] Spam Email Detection Using Deep Learning Techniques
    AbdulNabi, Isra'a
    Yaseen, Qussai
    [J]. 12TH INTERNATIONAL CONFERENCE ON AMBIENT SYSTEMS, NETWORKS AND TECHNOLOGIES (ANT) / THE 4TH INTERNATIONAL CONFERENCE ON EMERGING DATA AND INDUSTRY 4.0 (EDI40) / AFFILIATED WORKSHOPS, 2021, 184 : 853 - 858
  • [3] [Anonymous], 2024, Email Threat Landscape Report: Protecting Your Organization From Increased Malware, BEC, and Credential Phishing Attacks
  • [4] Arya Varsha, 2023, International Conference on Cyber Security, Privacy and Networking (ICSPN 2022). Lecture Notes in Networks and Systems (599), P341, DOI 10.1007/978-3-031-22018-0_31
  • [5] Debnath Kingshuk, 2022, 2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), P37, DOI 10.1109/COM-IT-CON54601.2022.9850588
  • [6] Spam filtering using a logistic regression model trained by an artificial bee colony algorithm
    Dedeturk, Bilge Kagan
    Akay, Bahriye
    [J]. APPLIED SOFT COMPUTING, 2020, 91
  • [7] A Support Vector Machine based Naive Bayes Algorithm for Spam Filtering
    Feng, Weimiao
    Sun, Jianguo
    Zhang, Liguo
    Cao, Cuiling
    Yang, Qing
    [J]. 2016 IEEE 35TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2016,
  • [8] Applicability of machine learning in spam and phishing email filtering: review and approaches
    Gangavarapu, Tushaar
    Jaidhar, C. D.
    Chanduka, Bhabesh
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2020, 53 (07) : 5019 - 5081
  • [9] Detecting Spam Email With Machine Learning Optimized With Bio-Inspired Metaheuristic Algorithms
    Gibson, Simran
    Issac, Biju
    Zhang, Li
    Jacob, Seibu Mary
    [J]. IEEE ACCESS, 2020, 8 : 187914 - 187932
  • [10] A review of spam email detection: analysis of spammer strategies and the dataset shift problem
    Janez-Martino, Francisco
    Alaiz-Rodriguez, Rocio
    Gonzalez-Castro, Victor
    Fidalgo, Eduardo
    Alegre, Enrique
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2023, 56 (02) : 1145 - 1173