Novel approaches to crawling important pages early

被引:9
作者
Alam, Md. Hijbul [1 ]
Ha, JongWoo [1 ]
Lee, SangKeun [1 ]
机构
[1] Korea Univ, Dept Comp Sci & Engn, Seoul 136701, South Korea
基金
新加坡国家研究基金会;
关键词
Web crawler; Crawl ordering; PageRank; Fractional PageRank;
D O I
10.1007/s10115-012-0535-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5 % in cumulative PageRank.
引用
收藏
页码:707 / 734
页数:28
相关论文
共 34 条
[1]  
Alam MH, 2009, LECT NOTES COMPUT SC, V5463, P590, DOI 10.1007/978-3-642-00887-0_52
[2]   Combining text and link analysis for focused crawling - An application for vertical search engines [J].
Almpanidis, G. ;
Kotropoulos, C. ;
Pitas, I. .
INFORMATION SYSTEMS, 2007, 32 (06) :886-908
[3]  
[Anonymous], 2004, VLDB
[4]  
[Anonymous], 2003, P 12 INT C WORLD WID
[5]  
Baeza-Yates R., 2005, SPEC INT TRACKS POST, P864, DOI [10.1145/1062745.1062768, DOI 10.1145/1062745.1062768]
[6]  
Bai X., 2011, Proceedings of the 20th ACM international conference on Information and knowledge management, P77, DOI DOI 10.1145/2063576.2063592
[7]  
Boldi P., 2004, P 13 INT C WORLD WID, P595
[8]   The anatomy of a large-scale hypertextual Web search engine [J].
Brin, S ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :107-117
[9]   PageRank revisited [J].
Technical University Ilmenau ;
不详 .
ACM Trans. Internet Technol., 2006, 3 (282-301) :282-301
[10]  
Castillo C., 2006, SIGIR Forum, V40, P11, DOI 10.1145/1189702.1189703