WAN-based distributed web crawling

被引:4
作者
Xu X. [1 ]
Zhang W.-Z. [1 ]
Zhang H.-L. [1 ]
Fang B.-X. [1 ]
机构
[1] School of Computer Science and Technology, Harbin Institue of Technology
来源
Ruan Jian Xue Bao/Journal of Software | 2010年 / 21卷 / 05期
关键词
Agent collaboration; Agent deployment; Search engine; WAN-based distributed crawling; Web partition;
D O I
10.3724/SP.J.1001.2010.03725
中图分类号
学科分类号
摘要
There are three core issues recognized for WAN-based distributed Web crawling systems: Web Partition, Agent collaboration and Agent deployment. Centering around these issues, this paper presents a comprehensive overview of the current strategies adopted by academic and business communities. The experiences, problems and challenges encountered by the WAN-based distributed Web crawlers are classified and discussed in depth. A summary of the current evaluation indicators is also given. Finally, conclusion and some suggestions for future research are put forward. © by Institute of Software, the Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:1067 / 1082
页数:15
相关论文
共 29 条
[1]  
The 21st statistical survey report on the Internet development in China, (2008)
[2]  
Brin S., Page L., The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems, 30, 1-7, pp. 107-117, (1998)
[3]  
Burner M., Crawling towards eternity-Building an archive of the World Wide Web, Web Techniques Magazine, 2, 5, pp. 37-40, (1997)
[4]  
Heydon A., Najork M., Mercator: A scalable, extensible Web crawler, World Wide Web, 2, 4, pp. 219-229, (1999)
[5]  
Korpela E., Werthimer D., Anderson D., Cobb J., Lebofsky M., SETI@HOME-Massively distributed computing for SETI, Computing in Science & Engineering, 3, 1, pp. 78-83, (2001)
[6]  
Cho J., Garcia-Molina H., Parallel crawlers, Proc. of the 11th Int'l Conf. on World Wide Web, pp. 124-135, (2002)
[7]  
Boldi P., Codenotti B., Santini M., Vigna S., Ubicrawler: A scalable fully distributed Web crawler, Software-Practice & Experience, 34, 8, pp. 711-726, (2004)
[8]  
Boswell D., Distributed high-performance Web crawlers: A survey of the state of the art, (2003)
[9]  
Baeza-Yates R., Castillo C., Junqueira F., Plachouras V., Silvestri F., Challenges in distributed information retrieval, Proc. of the Int'l Conf. on Data Engineering (ICDE), (2007)
[10]  
Papapetrou O., Samaras G., IPMicra: An IP-address based location aware distributed Web crawler, Proc. of the 5th Int'l Conf. on Internet Computing (IC 2004), pp. 694-699, (2004)