A combined feature selection approach for malicious email detection based on a comprehensive email dataset

被引:0
|
作者
Zhang, Han [1 ]
Shi, Yong [1 ]
Liu, Ming [1 ]
Chen, Libo [1 ]
Wu, Songyang [2 ]
Xue, Zhi [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Dongchuan Rd, Shanghai 200240, Peoples R China
[2] Third Res Inst Minist Publ Secur, Yueyang Rd, Shanghai 200031, Peoples R China
来源
CYBERSECURITY | 2025年 / 8卷 / 01期
关键词
Malicious email detection; Dataset; Machine learning; Random forest; SPAM;
D O I
10.1186/s42400-024-00309-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, new malicious email attacks have emerged. We summarize two major challenges in the current field of malicious email detection using machine learning algorithms. (1) Current works on malicious email detection use different datasets and lack a unified and comprehensive open source dataset standard for evaluating detection performance. In addition, outdated data makes it difficult to detect new types of malicious email attacks. (2) There are limitations in feature selection and extraction. Relying only on static features or body textual features cannot satisfy the detection of both common phishing or spam email and new malicious emails that exploit protocol vulnerabilities. To address these problems, we propose the Exploiting Protocol Vulnerability Malicious Email (EPVME) dataset, which contains 49,136 malicious email samples. The EPVME dataset is constructed by summarizing and simulating the novel types of malicious email attacks that exploit email protocol vulnerabilities. In our dataset, the coverage of the types of malicious emails and the number of them are significantly increased. By collecting the currently available open source datasets, we build a large-scale dataset with 660,985 samples. Through two sets of comparative experiments on the dataset containing EPVME, we verify the necessity, reliability, and validity of the EPVME dataset. By using a large and comprehensive open source email dataset, we hope to help subsequent work on malicious email detection achieve comparative performance. Furthermore, we propose a new feature selection and construction method that combines both static features and textual features. We extract 79 static features from both the header and body parts of email samples, perform textual feature extraction on the pre-processed body parts, and combine various machine learning algorithms for detection model construction and experimental comparison. Our detection model can achieve an accuracy of 99.968% and a false positive rate of 0.099%.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] Feature Selection Framework for Optimizing ML-based Malicious URL Detection
    Shah, Sajjad H.
    Garu, Amit
    Nguyen, Duong N.
    Borowczak, Mike
    2024 CYBER AWARENESS AND RESEARCH SYMPOSIUM, CARS 2024, 2024,
  • [22] Email Spam: A Comprehensive Review of Optimize Detection Methods, Challenges, and Open Research Problems
    Tusher, Ekramul Haque
    Ismail, Mohd Arfian
    Rahman, Md Arafatur
    Alenezi, Ali H.
    Uddin, Mueen
    IEEE ACCESS, 2024, 12 : 143627 - 143657
  • [23] Improved email spam detection model with negative selection algorithm and particle swarm optimization
    Idris, Ismaila
    Selamat, Ali
    APPLIED SOFT COMPUTING, 2014, 22 : 11 - 27
  • [24] Improved email spam detection model based on support vector machines
    Sunday Olusanya Olatunji
    Neural Computing and Applications, 2019, 31 : 691 - 699
  • [25] A Systematic Review on Deep-Learning-Based Phishing Email Detection
    Gray, L. Earl
    Conley, Justin M.
    Bursian, Steven J.
    Kamruzzaman, Abu
    Asif, Rameez
    ELECTRONICS, 2023, 12 (21)
  • [26] Improved email spam detection model based on support vector machines
    Olatunji, Sunday Olusanya
    NEURAL COMPUTING & APPLICATIONS, 2019, 31 (03): : 691 - 699
  • [27] An Optimized Approach for Detection and Classification of Spam Email's Using Ensemble Methods
    Fatima, Rubab
    Fareed, Mian Muhammad Sadiq
    Ullah, Saleem
    Ahmad, Gulnaz
    Mahmood, Saqib
    WIRELESS PERSONAL COMMUNICATIONS, 2024, 139 (01) : 347 - 373
  • [28] An Effective Feature Selection Algorithm for Machine Learning-based Malicious Traffic Detection
    Fei, Chao
    Xia, Nian
    Tsai, Pang-Wei
    Lu, Yang
    Pan, Xiaonan
    Gong, Junli
    2024 19TH ASIA JOINT CONFERENCE ON INFORMATION SECURITY, ASIAJCIS 2024, 2024, : 91 - 98
  • [29] Email-Based Cyberstalking Detection On Textual Data Using Multi-Model Soft Voting Technique Of Machine Learning Approach
    Gautam, Arvind Kumar
    Bansal, Abhishek
    JOURNAL OF COMPUTER INFORMATION SYSTEMS, 2023, 63 (06) : 1362 - 1381
  • [30] Malicious Website Detection Using Random Forest and Pearson Correlation for Effective Feature Selection
    Sangra, Esha
    Agrawal, Renuka
    Gundalwar, Pravin Ramesh
    Sharma, Kanhaiya
    Bangri, Divyansh
    Nandi, Debadrita
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (08) : 772 - 780