Improving spam email classification accuracy using ensemble techniques: a stacking approach

被引：7

作者：

Adnan, Muhammad ^{[1
]}

Imam, Muhammad Osama ^{[2
]}

Javed, Muhammad Furqan ^{[2
]}

Murtza, Iqbal ^{[2
]}

机构：

[1] UiT Arctic Univ Norway, Dept Technol & Safety, Tromso, Norway

[2] Air Univ Islamabad, Fac Comp & AI, Islamabad, Pakistan

来源：

INTERNATIONAL JOURNAL OF INFORMATION SECURITY | 2024年 / 23卷 / 01期

关键词：

Spam; Email; Classification; Machine learning; Ensemble; Stacking method;

D O I：

10.1007/s10207-023-00756-1

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Spam emails pose a substantial cybersecurity danger, necessitating accurate classification to reduce unwanted messages and mitigate risks. This study focuses on enhancing spam email classification accuracy using stacking ensemble machine learning techniques. We trained and tested five classifiers: logistic regression, decision tree, K-nearest neighbors (KNN), Gaussian naive Bayes and AdaBoost. To address overfitting, two distinct datasets of spam emails were aggregated and balanced. Evaluating individual classifiers based on recall, precision and F1 score metrics revealed AdaBoost as the top performer. Considering evolving spam technology and new message types challenging traditional approaches, we propose a stacking method. By combining predictions from multiple base models, the stacking method aims to improve classification accuracy. The results demonstrate superior performance of the stacking method with the highest accuracy (98.8%), recall (98.8%) and F1 score (98.9%) among tested methods. Additional experiments validated our approach by varying dataset sizes and testing different classifier combinations. Our study presents an innovative combination of classifiers that significantly improves accuracy, contributing to the growing body of research on stacking techniques. Moreover, we compare classifier performances using a unique combination of two datasets, highlighting the potential of ensemble techniques, specifically stacking, in enhancing spam email classification accuracy. The implications extend beyond spam classification systems, offering insights applicable to other classification tasks. Continued research on emerging spam techniques is vital to ensure long-term effectiveness.

引用

页码：505 / 517

页数：13

共 37 条

[21] Addressing the class imbalance problem in Twitter spam detection using ensemble learning [J].

Liu, Shigang ;

Wang, Yu ;

Zhang, Jun ;

Chen, Chao ;

Xiang, Yang .

COMPUTERS & SECURITY, 2017, 69 :35-49

[22]

Madhavan Mangena Venu, 2021, IOP Conference Series: Materials Science and Engineering, V1022, DOI 10.1088/1757-899X/1022/1/012113

[23] A stacked convolutional neural network for detecting the resource tweets during a disaster [J].

Madichetty, Sreenivasulu ;

Sridevi, M. .

MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (03) :3927-3949

[24]

Murphy KP, 2012, MACHINE LEARNING: A PROBABILISTIC PERSPECTIVE, P1

[25] A YouTube Spam Comments Detection Scheme Using Cascaded Ensemble Machine Learning Model [J].

Oh, Hayoung .

IEEE ACCESS, 2021, 9 :144121-144128

[26] Hyperparameter Optimization of Ensemble Models for Spam Email Detection [J].

Omotehinwa, Temidayo Oluwatosin ;

Oyewola, David Opeoluwa .

APPLIED SCIENCES-BASEL, 2023, 13 (03)

[27] Canning spam: Proposed solutions to unwanted email [J].

Pfleeger, SL ;

Bloom, G .

IEEE SECURITY & PRIVACY, 2005, 3 (02) :40-47

[28] Phishing detection and impersonated entity discovery using Conditional Random Field and Latent Dirichlet Allocation [J].

Ramanathan, Venkatesh ;

Wechsler, Harry .

COMPUTERS & SECURITY, 2013, 34 :123-139

[29] Analysis of e-Mail Spam Detection Using a Novel Machine Learning-Based Hybrid Bagging Technique [J].

Rayan, Alanazi .

COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022

[30]

Sahu Kavita, 2018, ICIC Express Letters, V12, P1213, DOI 10.24507/icicel.12.12.1213

← 1 2 3 4 →