A combined feature selection approach for malicious email detection based on a comprehensive email dataset

被引：0

作者：

Zhang, Han ^{[1
]}

Shi, Yong ^{[1
]}

Liu, Ming ^{[1
]}

Chen, Libo ^{[1
]}

Wu, Songyang ^{[2
]}

Xue, Zhi ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Dongchuan Rd, Shanghai 200240, Peoples R China

[2] Third Res Inst Minist Publ Secur, Yueyang Rd, Shanghai 200031, Peoples R China

来源：

CYBERSECURITY | 2025年 / 8卷 / 01期

关键词：

Malicious email detection; Dataset; Machine learning; Random forest; SPAM;

D O I：

10.1186/s42400-024-00309-6

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, new malicious email attacks have emerged. We summarize two major challenges in the current field of malicious email detection using machine learning algorithms. (1) Current works on malicious email detection use different datasets and lack a unified and comprehensive open source dataset standard for evaluating detection performance. In addition, outdated data makes it difficult to detect new types of malicious email attacks. (2) There are limitations in feature selection and extraction. Relying only on static features or body textual features cannot satisfy the detection of both common phishing or spam email and new malicious emails that exploit protocol vulnerabilities. To address these problems, we propose the Exploiting Protocol Vulnerability Malicious Email (EPVME) dataset, which contains 49,136 malicious email samples. The EPVME dataset is constructed by summarizing and simulating the novel types of malicious email attacks that exploit email protocol vulnerabilities. In our dataset, the coverage of the types of malicious emails and the number of them are significantly increased. By collecting the currently available open source datasets, we build a large-scale dataset with 660,985 samples. Through two sets of comparative experiments on the dataset containing EPVME, we verify the necessity, reliability, and validity of the EPVME dataset. By using a large and comprehensive open source email dataset, we hope to help subsequent work on malicious email detection achieve comparative performance. Furthermore, we propose a new feature selection and construction method that combines both static features and textual features. We extract 79 static features from both the header and body parts of email samples, perform textual feature extraction on the pre-processed body parts, and combine various machine learning algorithms for detection model construction and experimental comparison. Our detection model can achieve an accuracy of 99.968% and a false positive rate of 0.099%.

引用

页数：22

共 50 条

[31] Email-Based Cyberstalking Detection On Textual Data Using Multi-Model Soft Voting Technique Of Machine Learning Approach
Gautam, Arvind Kumar
Bansal, Abhishek
JOURNAL OF COMPUTER INFORMATION SYSTEMS, 2023, 63 (06) : 1362 - 1381
[32] Malicious Website Detection Using Random Forest and Pearson Correlation for Effective Feature Selection
Sangra, Esha
Agrawal, Renuka
Gundalwar, Pravin Ramesh
Sharma, Kanhaiya
Bangri, Divyansh
Nandi, Debadrita
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (08) : 772 - 780
[33] Email Spam Detection using integrated approach of Naive Bayes and Particle Swarm Optimization
Agarwal, Kriti
Kumar, Tarun
PROCEEDINGS OF THE 2018 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2018, : 685 - 690
[34] Detection of colon cancer based on microarray dataset using machine learning as a feature selection and classification techniques
A. S. M. Shafi
M. M. Imran Molla
Julakha Jahan Jui
Mohammad Motiur Rahman
SN Applied Sciences, 2020, 2
[35] Detection of colon cancer based on microarray dataset using machine learning as a feature selection and classification techniques
Shafi, A. S. M.
Molla, M. M. Imran
Jui, Julakha Jahan
Rahman, Mohammad Motiur
SN APPLIED SCIENCES, 2020, 2 (07):
[36] A feature-centric spam email detection model using diverse supervised machine learning algorithms
Zamir, Ammara
Khan, Hikmat Ullah
Mehmood, Waqar
Iqbal, Tassawar
Akram, Abubakker Usman
ELECTRONIC LIBRARY, 2020, 38 (03) : 633 - 657
[37] A novel approach for Arabic business email classification based on deep learning machines
Masri, Aladdin
Al-Jabi, Muhannad
PEERJ COMPUTER SCIENCE, 2023, 9
[38] Malicious PDF document detection based on mixed feature
Du X.
Lin Y.
Sun Y.
Tongxin Xuebao/Journal on Communications, 2019, 40 (02): : 118 - 128
[39] Novel interpretable and robust web-based AI platform for phishing email detection
Al-Subaiey, Abdulla
Al-Thani, Mohammed
Alam, Naser Abdullah
Antora, Kaniz Fatema
Khandakar, Amith
Zaman, S. M. Ashfaq Uz
COMPUTERS & ELECTRICAL ENGINEERING, 2024, 120
[40] Feature Selection Approach for Phishing Detection Based on Machine Learning
Wei, Yi
Sekiya, Yuji
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON APPLIED CYBER SECURITY (ACS) 2021, 2022, 378 : 61 - 70

← 1 2 3 4 5 →