Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods

被引:112
作者
Li, Tie [1 ]
Kou, Gang [2 ]
Peng, Yi [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Management & Econ, Chengdu 611731, Peoples R China
[2] Southwestern Univ Finance & Econ, Sch Business Adm, Chengdu 610074, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Feature engineering; Malicious URLs detection; Nystrom method; Distance metric learning; Singular value decomposition; NEAREST-NEIGHBOR; WEBSITES; SELECTION;
D O I
10.1016/j.is.2020.101494
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In malicious URLs detection, traditional classifiers are challenged because the data volume is huge, patterns are changing over time, and the correlations among features are complicated. Feature engineering plays an important role in addressing these problems. To better represent the underlying problem and improve the performances of classifiers in identifying malicious URLs, this paper proposed a combination of linear and non-linear space transformation methods. For linear transformation, a two-stage distance metric learning approach was developed: first, singular value decomposition was performed to get an orthogonal space, and then a linear programming was used to solve an optimal distance metric. For nonlinear transformation, we introduced Nystrom method for kernel approximation and used the revised distance metric for its radial basis function such that the merits of both linear and non-linear transformations can be utilized. 33,1622 URLs with 62 features were collected to validate the proposed feature engineering methods. The results showed that the proposed methods significantly improved the efficiency and performance of certain classifiers, such as k-Nearest Neighbor, Support Vector Machine, and neural networks. The malicious URLs' identification rate of k-Nearest Neighbor was increased from 68% to 86%, the rate of linear Support Vector Machine was increased from 58% to 81%, and the rate of Multi-Layer Perceptron was increased from 63% to 82%. We also developed a website to demonstrate a malicious URLs detection system which uses the methods proposed in this paper. The system can be accessed at: http://url.jspfans.com. (C) 2020 The Author(s). Published by Elsevier Ltd.
引用
收藏
页数:18
相关论文
共 52 条
  • [1] Abbasi A, 2010, MIS QUART, V34, P435
  • [2] Almeida Tiago A., 2011, Journal of Internet Services and Applications, V1, P183, DOI 10.1007/s13174-010-0014-7
  • [3] Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
    Andoni, Alexandr
    Indyk, Piotr
    [J]. COMMUNICATIONS OF THE ACM, 2008, 51 (01) : 117 - 122
  • [4] [Anonymous], 2018, ARXIV180904332
  • [5] Boosting decision stumps for dynamic feature selection on data streams
    Barddal, Jean Paul
    Enembreck, Fabricio
    Gomes, Heitor Murilo
    Bifet, Albert
    Pfahringer, Bernhard
    [J]. INFORMATION SYSTEMS, 2019, 83 : 13 - 29
  • [6] Borgwardt K.H., 1987, SIMPLEX METHOD PROBA
  • [7] Carpenter G.A., 1991, NEURAL NETWORKS, V56, P5
  • [8] A boosting approach for supervised Mahalanobis distance metric learning
    Chang, Chin-Chun
    [J]. PATTERN RECOGNITION, 2012, 45 (02) : 844 - 862
  • [9] Distributed In-Memory Processing of All k Nearest Neighbor Queries
    Chatzimilioudis, Georgios
    Costa, Constantinos
    Zeinalipour-Yazti, Demetrios
    Lee, Wang-Chien
    Pitoura, Evaggelia
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (04) : 925 - 938
  • [10] SETL: A programmable semantic extract-transform-load framework for semantic data warehouses
    Deb Nath, Rudra Pratap
    Hose, Katja
    Pedersen, Torben Bach
    Romero, Oscar
    [J]. INFORMATION SYSTEMS, 2017, 68 : 17 - 43