Content-aware web robot detection

被引:8
作者
Lagopoulos, Athanasios [1 ]
Tsoumakas, Grigorios [1 ]
机构
[1] Aristotle Univ Thessaloniki, Thessaloniki, Greece
关键词
Web robot; Crawler; Semantics; Supervised learning; Latent dirichlet allocation;
D O I
10.1007/s10489-020-01754-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Web crawlers account for more than a third of the total web traffic and they are threatening the security, privacy and veracity of web applications and their users. Businesses in finance, ticketing, and publishing, as well as websites with rich and unique content are the ones mostly affected by their actions. To deal with this problem, we present a novel web robot detection approach that takes advantage of the content of a website based on the assumption that human web users are interested in specific topics, while web robots crawl the web randomly. Our approach extends the typical user session representation of log-based features with a novel set of features that capture the semantics of the content of the requested resources. In addition, we contribute a new real-world dataset, which we make publicly available, towards alleviating the scarcity of open data in this field. Empirical results on this dataset validate our assumption and show that our approach outranks state-of-the-art methods for web robot detection.
引用
收藏
页码:4017 / 4028
页数:12
相关论文
共 32 条
[31]  
Zabihi M, 2014, 2014 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE), P23, DOI 10.1109/ICCKE.2014.6993362
[32]   A soft computing approach for benign and malicious web robot detection [J].
Zabihimayvan, Mandieh ;
Sadeghi, Reza ;
Rude, H. Nathan ;
Doran, Derek .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 87 :129-140