Explainable AI model for PDFMal detection based on gradient boosting model

被引:3
作者
Elattar, Mona [1 ,2 ]
Younes, Ahmed [2 ]
Gad, Ibrahim [1 ]
Elkabani, Islam [2 ,3 ]
机构
[1] Department of Computer Science, Faculty of Science, Tanta University, Tanta
[2] Department of Mathematics and Computer Science, Faculty of Science, Alexandria University, Alexandria
[3] Faculty of Computer Science and Engineering, Al Alamein International University, Alamein
关键词
Explainable artificial intelligence (XAI); Malicious PDF; Malware detection; Tree-based ensemble models;
D O I
10.1007/s00521-024-10314-y
中图分类号
学科分类号
摘要
Portable document formats (PDFs) are widely used for document exchange due to their widespread usage and versatility. However, PDFs are highly vulnerable to malware attacks, which pose significant security risks. Existing defense mechanisms often struggle to effectively detect and mitigate these threats, highlighting the need for more robust solutions. This paper introduces a robust framework that uses advanced tree-based ensemble models to detect malicious PDFs using the Evasive-PDFMal2022 dataset. The proposed model achieves a recall rate of 100%, an accuracy rate of 99.95%, and a fast inference time of 0.1723 s. Furthermore, the framework exhibits minimal false positive and false negative rates, ensuring a high level of precision in distinguishing between malicious and benign PDFs. Shapley additive explanations are used to improve the interpretability and reliability of the model’s predictions. The results highlight the effectiveness of the proposed model in improving PDF document security and addressing the challenges posed by malware attacks. © The Author(s) 2024.
引用
收藏
页码:21607 / 21622
页数:15
相关论文
共 41 条
[31]  
Arangala C., Linear algebra with machine learning and data, (2023)
[32]  
Geurts P., Ernst D., Wehenkel L., Extremely randomized trees, Mach Learn, 63, 1, pp. 3-42, (2006)
[33]  
Guryanov A., Histogram-based algorithm for building gradient boosting ensembles of piecewise linear decision trees, In: Analysis of Images, Social Networks and Texts: 8Th International Conference, AIST 2019, 8, pp. 39-50, (2019)
[34]  
Freund Y., Boosting a weak learning algorithm by majority, Inf Comput, 121, 2, pp. 256-285, (1995)
[35]  
Dorogush A.V., Ershov V., Gulin A., Catboost: Gradient Boosting with Categorical Features Support, (2018)
[36]  
Chen T., Guestrin C., Xgboost: A scalable tree boosting system, Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785-794, (2016)
[37]  
Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.-Y., Lightgbm: A highly efficient gradient boosting decision tree, Neural Information Processing Systems, (2017)
[38]  
Gad I., Elmezain M., Alwateer M.M., Almaliki M., Elmarhomy G., Atlam E., Breast cancer diagnosis using a machine learning model and swarm intelligence approach, 2023 1St International Conference on Advanced Innovations in Smart Cities (ICAISC), pp. 1-5, (2023)
[39]  
Lundberg S.M., Erion G.G., Lee S.-I., Consistent individualized feature attribution for tree ensembles, Arxiv Abs/1802, (2018)
[40]  
Sheskin D.J., Handbook of parametric and nonparametric statistical procedures, (2020)