Efficient crawling through URL ordering

被引:209
作者
Cho, J [1 ]
Garcia-Molina, H [1 ]
Page, L [1 ]
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
来源
COMPUTER NETWORKS AND ISDN SYSTEMS | 1998年 / 30卷 / 1-7期
关键词
crawling; URL ordering;
D O I
10.1016/S0169-7552(98)00108-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without. (C) 1998 Published by Elsevier Science B.V. All rights reserved.
引用
收藏
页码:161 / 172
页数:12
相关论文
共 6 条
  • [1] Brin S., 1998, P 7 INT WWW C BRISB
  • [2] KAHLE B, 1997, SCI AM MAR
  • [3] KOSTER M, 1995, CONNEXIONS, V9
  • [4] PAGE L, UNPUB PAGERANK CITAT
  • [5] PINKERTON B, 1994, P 2 INT WWW C CHIC O
  • [6] Salton G., 1988, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer