Web crawling

被引:153
作者
Olston C. [1 ]
Najork M. [2 ]
机构
[1] Yahoo Research, Sunnyvale, CA, 94089
[2] Microsoft Research, Mountain View, CA, 94043
来源
Foundations and Trends in Information Retrieval | 2010年 / 4卷 / 03期
关键词
D O I
10.1561/1500000017
中图分类号
学科分类号
摘要
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work. © 2010 C. Olston and M. Najork.
引用
收藏
页码:175 / 246
页数:71
相关论文
共 117 条
[1]  
Abiteboul S., Preda M., Cobena G., Adaptive on-line page importance computation, Proceedings of the 12th International World Wide Web Conference, (2003)
[2]  
Adar E., Teevan J., Dumais S.T., Elsas J.L., The web changes everything: Understanding the dynamics of web content, Proceedings of the 2nd International Conference on Web Search and Data Mining, (2009)
[3]  
Agarwal A., Koppula H.S., Leela K.P., Chitrapura K.P., Garg S., GM P.K., Haty C., Roy A., Sasturkar A., URL normalization for de-duplication of web, Proceedings of the 18th Conference on Information and Knowledge Management, (2009)
[4]  
Aggarwal C.C., Al-Garawi F., Yu P.S., Intelligent crawling on the world wide web with arbitrary predicates, Proceedings of the 10th International World Wide Web Conference, (2001)
[5]  
Ahlers D., Boll S., Adaptive geospatially focused crawling, Proceedings of the 18th Conference on Information and Knowledge Management, (2009)
[6]  
Attributor
[7]  
Baeza-Yates R., Castillo C., Crawling the infinite web, Journal of Web Engineering, 6, 1, pp. 49-72, (2007)
[8]  
Baeza-Yates R., Castillo C., Marin M., Rodriguez A., Crawling a country: Better strategies than breadth-first for web page ordering, Proceedings of the 14th International World Wide Web Conference, (2005)
[9]  
Bamba B., Liu L., Caverlee J., Padliya V., Srivatsa M., Bansal T., Palekar M., Patrao J., Li S., Singh A., DSphere: A source-centric approach to crawling, indexing and searching the world wide web, Proceedings of the 23rd International Conference on Data Engineering, (2007)
[10]  
Bar-Yossef Z., Gurevich M., Random sampling from a search engine's index, Proceedings of the 15th International World Wide Web Conference, (2006)