A combined feature selection approach for malicious email detection based on a comprehensive email dataset

被引:0
作者
Zhang, Han [1 ]
Shi, Yong [1 ]
Liu, Ming [1 ]
Chen, Libo [1 ]
Wu, Songyang [2 ]
Xue, Zhi [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Dongchuan Rd, Shanghai 200240, Peoples R China
[2] Third Res Inst Minist Publ Secur, Yueyang Rd, Shanghai 200031, Peoples R China
来源
CYBERSECURITY | 2025年 / 8卷 / 01期
关键词
Malicious email detection; Dataset; Machine learning; Random forest; SPAM;
D O I
10.1186/s42400-024-00309-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, new malicious email attacks have emerged. We summarize two major challenges in the current field of malicious email detection using machine learning algorithms. (1) Current works on malicious email detection use different datasets and lack a unified and comprehensive open source dataset standard for evaluating detection performance. In addition, outdated data makes it difficult to detect new types of malicious email attacks. (2) There are limitations in feature selection and extraction. Relying only on static features or body textual features cannot satisfy the detection of both common phishing or spam email and new malicious emails that exploit protocol vulnerabilities. To address these problems, we propose the Exploiting Protocol Vulnerability Malicious Email (EPVME) dataset, which contains 49,136 malicious email samples. The EPVME dataset is constructed by summarizing and simulating the novel types of malicious email attacks that exploit email protocol vulnerabilities. In our dataset, the coverage of the types of malicious emails and the number of them are significantly increased. By collecting the currently available open source datasets, we build a large-scale dataset with 660,985 samples. Through two sets of comparative experiments on the dataset containing EPVME, we verify the necessity, reliability, and validity of the EPVME dataset. By using a large and comprehensive open source email dataset, we hope to help subsequent work on malicious email detection achieve comparative performance. Furthermore, we propose a new feature selection and construction method that combines both static features and textual features. We extract 79 static features from both the header and body parts of email samples, perform textual feature extraction on the pre-processed body parts, and combine various machine learning algorithms for detection model construction and experimental comparison. Our detection model can achieve an accuracy of 99.968% and a false positive rate of 0.099%.
引用
收藏
页数:22
相关论文
共 50 条
  • [41] Business Email Compromise Phishing Detection Based on Machine Learning: A Systematic Literature Review
    Atlam, Hany F.
    Oluwatimilehin, Olayonu
    ELECTRONICS, 2023, 12 (01)
  • [42] Bayesian additive regression trees-based spam detection for enhanced email privacy
    Abu-Nimeh, Saeed
    Nappa, Dario
    Wang, Xinlei
    Nair, Suku
    ARES 2008: PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON AVAILABILITY, SECURITY AND RELIABILITY, 2008, : 1044 - 1051
  • [43] Spam email detection using a novel multilayer classification-based decision technique
    Das S.
    Mandal S.
    Basak R.
    International Journal of Computers and Applications, 2023, 45 (09) : 587 - 599
  • [44] A Hybrid Feature Selection Approach for Parkinson's Detection Based on Mutual Information Gain and Recursive Feature Elimination
    Lamba, Rohit
    Gulati, Tarun
    Jain, Anurag
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2022, 47 (08) : 10263 - 10276
  • [45] Suboptimal Feature Selection Techniques for Effective Malicious Traffic Detection on Lightweight Devices
    Jeon, So-Eun
    Oh, Ye-Sol
    Lee, Yeon-Ji
    Lee, Il-Gu
    CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2024, 140 (02): : 1669 - 1687
  • [46] A Hybrid Feature Selection Approach for Parkinson’s Detection Based on Mutual Information Gain and Recursive Feature Elimination
    Rohit Lamba
    Tarun Gulati
    Anurag Jain
    Arabian Journal for Science and Engineering, 2022, 47 : 10263 - 10276
  • [47] Frequency Domain Feature Based Robust Malicious Traffic Detection
    Fu, Chuanpu
    Li, Qi
    Shen, Meng
    Xu, Ke
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2023, 31 (01) : 452 - 467
  • [48] Header Based Email Spam Detection Framework Using Support Vector Machine (SVM) Technique
    Khamis, Siti Aqilah
    Foozy, Cik Feresa Mohd
    Aziz, Mohd Firdaus Ab
    Rahim, Nordiana
    RECENT ADVANCES ON SOFT COMPUTING AND DATA MINING (SCDM 2020), 2020, 978 : 57 - 65
  • [49] INTRUSION DETECTION BASED ON MACHINE LEARNING AND FEATURE SELECTION
    Alaoui, Souad
    El Gonnouni, Amina
    Lyhyaoui, Abdelouahid
    MENDEL 2011 - 17TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING, 2011, : 199 - 206
  • [50] Analysis of Permission Selection Techniques in Machine Learning-based Malicious App Detection
    Park, Jihyeon
    Kang, Munyeong
    Cho, Seong-je
    Han, Hyoil
    Suh, Kyoungwon
    2020 IEEE THIRD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2020), 2020, : 92 - 99