Feature evaluation for web crawler detection with data mining techniques

被引:59
作者
Stevanovic, Dusan [1 ]
An, Aijun [1 ]
Vlajic, Natalija [1 ]
机构
[1] York Univ, Dept Comp Sci & Engn, Toronto, ON M3J 1P3, Canada
关键词
Web crawler detection; Web server access logs; Data mining; Classification; DDoS; WEKA;
D O I
10.1016/j.eswa.2012.01.210
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Distributed Denial of Service (DDoS) is one of the most damaging attacks on the Internet security today. Recently, malicious web crawlers have been used to execute automated DDoS attacks on web sites across the WWW. In this study we examine the effect of applying seven well-established data mining classification algorithms on static web server access logs in order to: (1) classify user sessions as belonging to either automated web crawlers or human visitors and (2) identify which of the automated web crawlers sessions exhibit 'malicious' behavior and are potentially participants in a DDoS attack. The classification performance is evaluated in terms of classification accuracy, recall, precision and F-1 score. Seven out of nine vector (i.e. web-session) features employed in our work are borrowed from earlier studies on classification of user sessions as belonging to web crawlers. However, we also introduce two novel web-session features: the consecutive sequential request ratio and standard deviation of page request depth. The effectiveness of the new features is evaluated in terms of the information gain and gain ratio metrics. The experimental results demonstrate the potential of the new features to improve the accuracy of data mining classifiers in identifying malicious and well-behaved web crawler sessions. (c) 2012 Elsevier Ltd. All rights reserved.
引用
收藏
页码:8707 / 8717
页数:11
相关论文
共 18 条
[1]  
[Anonymous], 2008 IEEE GLOB TEL C
[2]  
[Anonymous], 2014, C4. 5: programs for machine learning
[3]  
[Anonymous], 2011, Pei. data mining concepts and techniques
[4]  
Cohen W. W., 1995, Machine Learning. Proceedings of the Twelfth International Conference on Machine Learning, P115
[5]  
Doran D., 2010, DATA MIN KNOWL DISC, P1
[6]   Web Spambot Detection Based on Web Navigation Behaviour [J].
Hayati, Pedram ;
Potdar, Vidyasagar ;
Chai, Kevin ;
Talevski, Alex .
2010 24TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS (AINA), 2010, :797-803
[7]   Malicious web content detection by machine learning [J].
Hou, Yung-Tsung ;
Chang, Yimeng ;
Chen, Tsuhan ;
Laih, Chi-Sung ;
Chen, Chia-Mei .
EXPERT SYSTEMS WITH APPLICATIONS, 2010, 37 (01) :55-60
[8]   Detection of cloaked web spam by using tag-based methods [J].
Lin, Jun-Lin .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (04) :7493-7499
[9]   Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users' future requests [J].
Liu, Haibin ;
Keselj, Vlado .
DATA & KNOWLEDGE ENGINEERING, 2007, 61 (02) :304-330
[10]   Web robot detection based on hidden Markov model [J].
Lu, Wei-Zhou ;
Yu, Shun-Zheng .
2006 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CIRCUITS AND SYSTEMS PROCEEDINGS, VOLS 1-4: VOL 1: SIGNAL PROCESSING, 2006, :1806-+