Online URL Classification for Large-Scale Streaming Environments

被引:5
作者
Singh, Neetu [1 ]
Chaudhari, Narendra S. [2 ]
Singh, Nidhi [3 ]
机构
[1] Cent Univ Himachal Pradesh, Dept Comp Sci, Shahpur, Himachal Prades, India
[2] Visvesvaraya Natl Inst Technol, Nagpur, Maharashtra, India
[3] Intel Secur, Duisburg, Germany
关键词
applications; artificial intelligence; big data; Computing methodologies; expert systems; intelligent systems; pattern recognition;
D O I
10.1109/MIS.2017.39
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale streaming URLs are the norm in many commercial software products that aim to filter URLs based on their sensitivity or risk level. In such problem scenarios, filtering is typically done by classifying a URL using either its webpage content or certain additional contextual information. However, such approaches are slow and computationally expensive, as they require gathering and processing webpage content or other contextual information for each URL. In this work, the authors propose a method for classifying URLs in large-scale streaming environments that doesn't suffer from these drawbacks. The proposed method is based on online ensemble learning, which results in lightweight prediction models that are well-suited for classification of streaming datasets. The authors illustrate the effectiveness of the proposed approach using large-scale datasets from a live, production environment and show that the proposed method results in an increase of 5 to 8 percent in terms of precision and 3 to 5.5 percent in terms of recall. © 2017 IEEE.
引用
收藏
页码:31 / 36
页数:6
相关论文
共 12 条
  • [1] [Anonymous], 2008, Introduction to information retrieval
  • [2] [Anonymous], 2014, Revised Selected Papers
  • [3] A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
    Baykan, Eda
    Henzinger, Monika
    Marian, Ludmila
    Weber, Ingmar
    [J]. ACM TRANSACTIONS ON THE WEB, 2011, 5 (03)
  • [4] Cesa-Bianchi N., 2006, PREDICTION LEARNING
  • [5] Fan RE, 2008, J MACH LEARN RES, V9, P1871
  • [6] Ma J, 2009, KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, P1245
  • [7] Marchal S., 2014, IEEE T NETWORK SERVI, V11
  • [8] PhishStorm: Detecting Phishing With Streaming Analytics
    Marchal, Samuel
    Francois, Jerome
    State, Radu
    Engel, Thomas
    [J]. IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2014, 11 (04): : 458 - 471
  • [9] Phan X.H., 2008, Proceedings of the 17th international conference on World Wide Web, WWW '08, P91
  • [10] Web Page Classification: Features and Algorithms
    Qi, Xiaoguang
    Davison, Brian D.
    [J]. ACM COMPUTING SURVEYS, 2009, 41 (02)