Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods

被引：112

作者：

Li, Tie ^{[1
]}

Kou, Gang ^{[2
]}

Peng, Yi ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Management & Econ, Chengdu 611731, Peoples R China

[2] Southwestern Univ Finance & Econ, Sch Business Adm, Chengdu 610074, Peoples R China

来源：

INFORMATION SYSTEMS | 2020年 / 91卷

基金：

中国国家自然科学基金; 中国博士后科学基金;

关键词：

Feature engineering; Malicious URLs detection; Nystrom method; Distance metric learning; Singular value decomposition; NEAREST-NEIGHBOR; WEBSITES; SELECTION;

D O I：

10.1016/j.is.2020.101494

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In malicious URLs detection, traditional classifiers are challenged because the data volume is huge, patterns are changing over time, and the correlations among features are complicated. Feature engineering plays an important role in addressing these problems. To better represent the underlying problem and improve the performances of classifiers in identifying malicious URLs, this paper proposed a combination of linear and non-linear space transformation methods. For linear transformation, a two-stage distance metric learning approach was developed: first, singular value decomposition was performed to get an orthogonal space, and then a linear programming was used to solve an optimal distance metric. For nonlinear transformation, we introduced Nystrom method for kernel approximation and used the revised distance metric for its radial basis function such that the merits of both linear and non-linear transformations can be utilized. 33,1622 URLs with 62 features were collected to validate the proposed feature engineering methods. The results showed that the proposed methods significantly improved the efficiency and performance of certain classifiers, such as k-Nearest Neighbor, Support Vector Machine, and neural networks. The malicious URLs' identification rate of k-Nearest Neighbor was increased from 68% to 86%, the rate of linear Support Vector Machine was increased from 58% to 81%, and the rate of Multi-Layer Perceptron was increased from 63% to 82%. We also developed a website to demonstrate a malicious URLs detection system which uses the methods proposed in this paper. The system can be accessed at: http://url.jspfans.com. (C) 2020 The Author(s). Published by Elsevier Ltd.

引用

页数：18

共 52 条

[1] Abbasi A, 2010, MIS QUART, V34, P435
[2] Almeida Tiago A., 2011, Journal of Internet Services and Applications, V1, P183, DOI 10.1007/s13174-010-0014-7
[3] Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Andoni, Alexandr
Indyk, Piotr
[J]. COMMUNICATIONS OF THE ACM, 2008, 51 (01) : 117 - 122
[4] [Anonymous], 2018, ARXIV180904332
[5] Boosting decision stumps for dynamic feature selection on data streams
Barddal, Jean Paul
Enembreck, Fabricio
Gomes, Heitor Murilo
Bifet, Albert
Pfahringer, Bernhard
[J]. INFORMATION SYSTEMS, 2019, 83 : 13 - 29
[6] Borgwardt K.H., 1987, SIMPLEX METHOD PROBA
[7] Carpenter G.A., 1991, NEURAL NETWORKS, V56, P5
[8] A boosting approach for supervised Mahalanobis distance metric learning
Chang, Chin-Chun
[J]. PATTERN RECOGNITION, 2012, 45 (02) : 844 - 862
[9] Distributed In-Memory Processing of All k Nearest Neighbor Queries
Chatzimilioudis, Georgios
Costa, Constantinos
Zeinalipour-Yazti, Demetrios
Lee, Wang-Chien
Pitoura, Evaggelia
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (04) : 925 - 938
[10] SETL: A programmable semantic extract-transform-load framework for semantic data warehouses
Deb Nath, Rudra Pratap
Hose, Katja
Pedersen, Torben Bach
Romero, Oscar
[J]. INFORMATION SYSTEMS, 2017, 68 : 17 - 43

← 1 2 3 4 5 6 →