Paradoxical Effects in PageRank Incremental Computations

被引:8
作者
Boldi, Paolo [1 ]
Santini, Massimo [1 ]
Vigna, Sebastiano [1 ]
机构
[1] Univ Milan, Dipartimento Sci Informaz, Via Comel 39-41, I-20135 Milan, Italy
关键词
D O I
10.1080/15427951.2005.10129106
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Deciding which kind of visiting strategy accumulates high-quality pages more quickly is one of the most often debated issues in the design of web crawlers. This paper proposes a related, and previously overlooked, measure of effectiveness for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned. More precisely, we are interested in determining how rapidly the computation of PageRank over the visited subgraph yields node orders that agree with the ones computed in the complete graph; orders are compared using Kendall's tau. We describe a number of large-scale experiments that show the following paradoxical effect: visits that gather PageRank more quickly (e.g., highest-quality first) are also those that tend to miscalculate PageRank. Finally, we perform the same kind of experimental analysis on some synthetic random graphs, generated using well-known web-graph models: the results are almost opposite to those obtained on real web graphs.
引用
收藏
页码:387 / 404
页数:18
相关论文
共 29 条
[1]  
Abiteboul S., 2003, P 12 INT C WORLD WID, P280
[2]   Internet -: Diameter of the World-Wide Web [J].
Albert, R ;
Jeong, H ;
Barabási, AL .
NATURE, 1999, 401 (6749) :130-131
[3]   Approximate indexed lists [J].
Andersson, A ;
Petersson, O .
JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 1998, 29 (02) :256-276
[4]  
[Anonymous], 1980, RANK CORRELATION MET
[5]  
Bianchini M., 2005, ACM Transactions on Internet Technology, V5, P92, DOI 10.1145/1052934.1052938
[6]   UbiCrawler: a scalable fully distributed Web crawler [J].
Boldi, P ;
Codenotti, B ;
Santini, M ;
Vigna, S .
SOFTWARE-PRACTICE & EXPERIENCE, 2004, 34 (08) :711-726
[7]   Codes for the World Wide Web [J].
Boldi, Paolo ;
Vigna, Sebastiano .
INTERNET MATHEMATICS, 2005, 2 (04) :407-429
[8]  
Boldi Paolo, 2004, P 13 INT C ONWORLD W, P595
[9]   The anatomy of a large-scale hypertextual Web search engine [J].
Brin, S ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :107-117
[10]   Scheduling algorithms for Web crawling [J].
Castillo, C ;
Marin, M ;
Rodriguez, A ;
Baeza-Yates, R .
WEBMEDIA & LA-WEB 2004, VOL 1, PROCEEDINGS, 2004, :10-17