A combined feature selection approach for malicious email detection based on a comprehensive email dataset

被引:0
|
作者
Zhang, Han [1 ]
Shi, Yong [1 ]
Liu, Ming [1 ]
Chen, Libo [1 ]
Wu, Songyang [2 ]
Xue, Zhi [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Dongchuan Rd, Shanghai 200240, Peoples R China
[2] Third Res Inst Minist Publ Secur, Yueyang Rd, Shanghai 200031, Peoples R China
来源
CYBERSECURITY | 2025年 / 8卷 / 01期
关键词
Malicious email detection; Dataset; Machine learning; Random forest; SPAM;
D O I
10.1186/s42400-024-00309-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, new malicious email attacks have emerged. We summarize two major challenges in the current field of malicious email detection using machine learning algorithms. (1) Current works on malicious email detection use different datasets and lack a unified and comprehensive open source dataset standard for evaluating detection performance. In addition, outdated data makes it difficult to detect new types of malicious email attacks. (2) There are limitations in feature selection and extraction. Relying only on static features or body textual features cannot satisfy the detection of both common phishing or spam email and new malicious emails that exploit protocol vulnerabilities. To address these problems, we propose the Exploiting Protocol Vulnerability Malicious Email (EPVME) dataset, which contains 49,136 malicious email samples. The EPVME dataset is constructed by summarizing and simulating the novel types of malicious email attacks that exploit email protocol vulnerabilities. In our dataset, the coverage of the types of malicious emails and the number of them are significantly increased. By collecting the currently available open source datasets, we build a large-scale dataset with 660,985 samples. Through two sets of comparative experiments on the dataset containing EPVME, we verify the necessity, reliability, and validity of the EPVME dataset. By using a large and comprehensive open source email dataset, we hope to help subsequent work on malicious email detection achieve comparative performance. Furthermore, we propose a new feature selection and construction method that combines both static features and textual features. We extract 79 static features from both the header and body parts of email samples, perform textual feature extraction on the pre-processed body parts, and combine various machine learning algorithms for detection model construction and experimental comparison. Our detection model can achieve an accuracy of 99.968% and a false positive rate of 0.099%.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Improving Email Spam Detection Using Content Based Feature Engineering Approach
    Hijawi, Wadi'
    Faris, Hossam
    Alqatawna, Ja'far
    Al-Zoubi, Ala' M.
    Aljarah, Ibrahim
    2017 IEEE JORDAN CONFERENCE ON APPLIED ELECTRICAL ENGINEERING AND COMPUTING TECHNOLOGIES (AEECT), 2017,
  • [2] Analysis of Malicious Email Detection using Cialdini's Principles
    Nishikawa, Hiroki
    Yamamoto, Takumi
    Harsham, Bret
    Wang, Ye
    Uehara, Kota
    Hori, Chiori
    Iwasaki, Aiko
    Kawauchi, Kiyoto
    Nishigaki, Masakatsu
    2020 15TH ASIA JOINT CONFERENCE ON INFORMATION SECURITY (ASIAJCIS 2020), 2020, : 137 - 142
  • [3] Feature Selection and Similarity Coefficient Based Method for Email Spam Filtering
    Abdelrahim, Ali Ahmed A.
    Elhadi, Ammar Ahmed E.
    Ibrahim, Hamza
    Elmisbah, Naser
    2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONICS ENGINEERING (ICCEEE), 2013, : 630 - 633
  • [4] A Comprehensive Survey for Intelligent Spam Email Detection
    Karim, Asif
    Azam, Sami
    Shanmugam, Bharanidharan
    Kannoorpatti, Krishnan
    Alazab, Mamoun
    IEEE ACCESS, 2019, 7 : 168261 - 168295
  • [5] Enhancing Arabic Phishing Email Detection: A Hybrid Machine Learning Based on Genetic Algorithm Feature Selection
    Alsuwaylimi, Amjad A.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (08) : 312 - 325
  • [6] Social feature-based enterprise email classification without examining email contents
    Wang, Min-Feng
    Tsai, Meng-Feng
    Jheng, Sie-Long
    Tang, Cheng-Hsien
    JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2012, 35 (02) : 770 - 777
  • [7] An Evaluation on the Efficiency of Hybrid Feature Selection in Spam Email Classification
    Mohamad, Masurah
    Selamat, Ali
    2015 2ND INTERNATIONAL CONFERENCE ON COMPUTER, COMMUNICATIONS, AND CONTROL TECHNOLOGY (I4CT), 2015,
  • [8] Targeted Malicious Email Detection using Hypervisor-based Dynamic Analysis and Ensemble Learning
    Zhang, Jian
    Li, Wenzhen
    Gong, Liangyi
    Gu, Zhaojun
    Wu, Jeffrey
    2019 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2019,
  • [9] The Assessment of Feature Selection Methods on Agglutinative Language for Spam Email Detection: A Special Case for Turkish
    Ergin, Semih
    Isik, Sahin
    2014 IEEE INTERNATIONAL SYMPOSIUM ON INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS (INISTA 2014), 2014, : 122 - 125
  • [10] Email spam detection by deep learning models using novel feature selection technique and BERT
    Nasreen, Ghazala
    Khan, Muhammad Murad
    Younus, Muhammad
    Zafar, Bushra
    Hanif, Muhammad Kashif
    EGYPTIAN INFORMATICS JOURNAL, 2024, 26