Effective page refresh policies for Web crawlers

被引:124
作者
Cho, J
Garcia-Molina, H
机构
[1] Univ Calif Los Angeles, Dept Comp Sci, Los Angeles, CA 90095 USA
[2] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2003年 / 28卷 / 04期
关键词
algorithms; design; performance; Web crawlers; world-wide web; web search engines; page refresh;
D O I
10.1145/958942.958945
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date. This article proposes various refresh policies and studies their effectiveness. We first formalize the notion of "freshness" of copied data by defining two freshness metrics, and we propose a Poisson process as the change model of data sources. Based on this framework, we examine the effectiveness of the proposed refresh policies analytically and experimentally. We show that a Poisson process is a good model to describe the changes of Web pages and we also show that our proposed refresh policies improve the "freshness" of data very significantly. In certain cases, we got orders of magnitude improvement from existing policies.
引用
收藏
页码:390 / 426
页数:37
相关论文
共 38 条
  • [1] DATA CACHING ISSUES IN AN INFORMATION-RETRIEVAL SYSTEM
    ALONSO, R
    BARBARA, D
    GARCIAMOLINA, H
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 1990, 15 (03): : 359 - 384
  • [2] [Anonymous], P ACM SIGCHI C HUM F
  • [3] BARBARA D, 1995, P INT C EXT DAT TECH, P373
  • [4] AN ALGORITHM FOR CONCURRENCY-CONTROL AND RECOVERY IN REPLICATED DISTRIBUTED DATABASES
    BERNSTEIN, PA
    GOODMAN, N
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 1984, 9 (04): : 596 - 615
  • [5] BERNSTEIN PA, 1980, P 6 INT C VER LARG D, P126
  • [6] Brewington B.E., 2000, P 9 INT WORLD WID WE
  • [7] Keeping up with the changing Web
    Brewington, BE
    Cybenko, G
    [J]. COMPUTER, 2000, 33 (05) : 52 - +
  • [8] CHAKRABARTI S, 1999, P 8 INT WORLD WID WE
  • [9] CHO J, 2003, ACM T INTERNET TECH, V3
  • [10] CHO J, 1998, P 7 INT WORLD WID WE