Detecting Web-Based Attacks with SHAP and Tree Ensemble Machine Learning Methods

被引:4
作者
Ndichu, Samuel [1 ,2 ]
Kim, Sangwook [1 ]
Ozawa, Seiichi [1 ,3 ]
Ban, Tao [2 ]
Takahashi, Takeshi [2 ]
Inoue, Daisuke [2 ]
机构
[1] Kobe Univ, Grad Sch Engn, Kobe, Hyogo 6578501, Japan
[2] Natl Inst Informat & Commun Technol, Tokyo 1848795, Japan
[3] Kobe Univ, Ctr Math & Data Sci, Kobe, Hyogo 6578501, Japan
来源
APPLIED SCIENCES-BASEL | 2022年 / 12卷 / 01期
关键词
web-based attacks; feature selection; Shapley additive explanations; tree ensemble methods; machine learning; CLICK FRAUD;
D O I
10.3390/app12010060
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Attacks using Uniform Resource Locators (URLs) and their JavaScript (JS) code content to perpetrate malicious activities on the Internet are rampant and continuously evolving. Methods such as blocklisting, client honeypots, domain reputation inspection, and heuristic and signature-based systems are used to detect these malicious activities. Recently, machine learning approaches have been proposed; however, challenges still exist. First, blocklist systems are easily evaded by new URLs and JS code content, obfuscation, fast-flux, cloaking, and URL shortening. Second, heuristic and signature-based systems do not generalize well to zero-day attacks. Third, the Domain Name System allows cybercriminals to easily migrate their malicious servers to hide their Internet protocol addresses behind domain names. Finally, crafting fully representative features is challenging, even for domain experts. This study proposes a feature selection and classification approach for malicious JS code content using Shapley additive explanations and tree ensemble methods. The JS code features are obtained from the Abstract Syntax Tree form of the JS code, sample JS attack codes, and association rule mining. The malicious and benign JS code datasets obtained from Hynek Petrak and the Majestic Million Service were used for performance evaluation. We compared the performance of the proposed method to those of other feature selection methods in the task of malicious JS code content detection. With a recall of 0.9989, our experimental results show that the proposed approach is a better prediction model.
引用
收藏
页数:20
相关论文
共 48 条
[1]   Analyzing the ecosystem of malicious URL redirection through longitudinal observation from honeypots [J].
Akiyama, Mitsuaki ;
Yagi, Takeshi ;
Yada, Takeshi ;
Mori, Tatsuya ;
Kadobayashi, Youki .
COMPUTERS & SECURITY, 2017, 69 :155-173
[2]   Malware Detection using DNS Records and Domain Name Features [J].
Al Messabi, Khulood ;
Aldwairi, Monther ;
Al Yousif, Ayesha ;
Thoban, Anoud ;
Belqasmi, Fatna .
ICFNDS'18: PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON FUTURE NETWORKS AND DISTRIBUTED SYSTEMS, 2018,
[3]  
Alexa Inc, TOP 500 SIT WEB
[4]  
Anand A, 2018, IEEE INT CONF BIG DA, P1168, DOI 10.1109/BigData.2018.8622547
[5]  
[Anonymous], 2016, P 3 INT C DIG SEC FO
[6]  
[Anonymous], JAVASCRIPT MALWARE C
[7]  
[Anonymous], 2004, 11 ANN NETW DISTR SY
[8]  
[Anonymous], 2009, P 26 ANN INT C MACH, DOI DOI 10.1145/1553374.1553462
[9]   EXPOSURE: A Passive DNS Analysis Service to Detect and Report Malicious Domains [J].
Bilge, Leyla ;
Sen, Sevil ;
Balzarotti, Davide ;
Kirda, Engin ;
Kruegel, Christopher .
ACM TRANSACTIONS ON INFORMATION AND SYSTEM SECURITY, 2014, 16 (04)
[10]  
Bilge Leyla, 2011, P NETW DISTR SYST SE, P1