Content-aware web robot detection

被引:7
作者
Lagopoulos, Athanasios [1 ]
Tsoumakas, Grigorios [1 ]
机构
[1] Aristotle Univ Thessaloniki, Thessaloniki, Greece
关键词
Web robot; Crawler; Semantics; Supervised learning; Latent dirichlet allocation;
D O I
10.1007/s10489-020-01754-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Web crawlers account for more than a third of the total web traffic and they are threatening the security, privacy and veracity of web applications and their users. Businesses in finance, ticketing, and publishing, as well as websites with rich and unique content are the ones mostly affected by their actions. To deal with this problem, we present a novel web robot detection approach that takes advantage of the content of a website based on the assumption that human web users are interested in specific topics, while web robots crawl the web randomly. Our approach extends the typical user session representation of log-based features with a novel set of features that capture the semantics of the content of the requested resources. In addition, we contribute a new real-world dataset, which we make publicly available, towards alleviating the scarcity of open data in this field. Empirical results on this dataset validate our assumption and show that our approach outranks state-of-the-art methods for web robot detection.
引用
收藏
页码:4017 / 4028
页数:12
相关论文
共 32 条
[1]  
AlNoamany Y, 2013, ACM-IEEE J CONF DIG, P339
[2]  
[Anonymous], 2018, OW AUT THREAT HDB WE
[3]  
[Anonymous], 2018, ANN C INF SCI SYST
[4]  
[Anonymous], 2018, ARXIV180109715
[5]  
[Anonymous], 2019, PROC NAACL
[6]  
[Anonymous], 2019, 2019 BAD BOT REP
[7]  
[Anonymous], 2018, 2018 BAD BOT REPORT
[8]   A fuzzy neural network based framework to discover user access patterns from web log data [J].
Ansari, Zahid A. ;
Sattar, Syed Abdul ;
Babu, A. Vinaya .
ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2017, 11 (03) :519-546
[9]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[10]  
Bojanowski P., 2016, Transactions of the Association for Computational Linguistics, V5, P135, DOI [DOI 10.48550/ARXIV.1607.04606, 10.48550/arxiv.1607.04606, DOI 10.1162/TACLA00051]