Look back, look around: A systematic analysis of effective predictors for new outlinks in focused Web crawling

被引:4
作者
Dang, Thi Kim Nhung [1 ]
Bucur, Doina [1 ]
Atil, Berk [2 ]
Pitel, Guillaume [3 ,4 ]
Ruis, Frank [1 ]
Kadkhodaei, Hamidreza [1 ]
Litvak, Nelly [1 ,5 ]
机构
[1] Univ Twente, Enschede, Netherlands
[2] Bogazici Univ, Istanbul, Turkiye
[3] Babbar, Paris, France
[4] Exensa, Montrouge, France
[5] Eindhoven Univ Technol, Eindhoven, Netherlands
关键词
Web change prediction; Focused crawling; Web mining; Statistical models; Probabilistic regression; Web search engines; LARGE-SCALE; PATTERNS;
D O I
10.1016/j.knosys.2022.110126
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and of its content-related pages. Hence, we propose a new 'look back, look around' (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. The LBLA approach proved extremely effective, outperforming other models including those that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modeling. All experiments were carried out on an original dataset, made available by a commercial focused crawler. (c) 2022 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
引用
收藏
页数:16
相关论文
共 59 条
  • [1] Adar Eytan, 2009, P 2 INT C WEB SEARCH, P282
  • [2] Time series motifs discovery under DTW allows more robust discovery of conserved structure
    Alaee, Sara
    Mercer, Ryan
    Kamgar, Kaveh
    Keogh, Eamonn
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2021, 35 (03) : 863 - 910
  • [3] Novel approaches to crawling important pages early
    Alam, Md. Hijbul
    Ha, JongWoo
    Lee, SangKeun
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 33 (03) : 707 - 734
  • [4] Using Web Pages Dynamicity to Prioritise Web Crawling
    Alderratia, Nisreen
    Elsheh, Mohammed
    [J]. PROCEEDINGS OF THE 2019 2ND INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND MACHINE INTELLIGENCE (MLMI 2019), 2019, : 40 - 44
  • [5] [Anonymous], 2021, WEBINSIGHT GITH REP
  • [6] [Anonymous], 2010, P 3 INT C WEB SEARCH, DOI DOI 10.1145/1718487.1718489
  • [7] Online algorithms for estimating change rates of web pages
    Avrachenkov, Konstantin
    Patil, Kishor
    Thoppe, Gugan
    [J]. PERFORMANCE EVALUATION, 2022, 153
  • [8] Change Rate Estimation and Optimal Freshness in Web Page Crawling
    Avrachenkov, Konstantin
    Patil, Kishor
    Thoppe, Gugan
    [J]. PROCEEDINGS OF THE 13TH EAI INTERNATIONAL CONFERENCE ON PERFORMANCE EVALUATION METHODOLOGIES AND TOOLS ( VALUETOOLS 2020), 2020, : 3 - 10
  • [9] Avrachenkov Konstantin, 2021, 7 IND CONTR C 2021 M
  • [10] Tractable near-optimal policies for crawling
    Azar, Yossi
    Horvitz, Eric
    Lubetzky, Eyal
    Peres, Yuval
    Shahaf, Dafna
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2018, 115 (32) : 8099 - 8103