Web crawling

被引:153
作者
Olston C. [1 ]
Najork M. [2 ]
机构
[1] Yahoo Research, Sunnyvale, CA, 94089
[2] Microsoft Research, Mountain View, CA, 94043
来源
Foundations and Trends in Information Retrieval | 2010年 / 4卷 / 03期
关键词
D O I
10.1561/1500000017
中图分类号
学科分类号
摘要
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work. © 2010 C. Olston and M. Najork.
引用
收藏
页码:175 / 246
页数:71
相关论文
共 117 条
[51]  
Dean J., Henzinger M., Finding related pages in the world wide web, Proceedings of the 8th International World Wide Web Conference, (1999)
[52]  
DeBra P., Post R., Information retrieval in the world wide web: Making client-based searching feasible, Proceedings of the 1st International World Wide Web Conference, (1994)
[53]  
Diligenti M., Coetzee F.M., Lawrence S., Giles C.L., Gori M., Focused crawling using context graphs, Proceedings of the 26th International Conference on Very Large Data Bases, (2000)
[54]  
Duda C., Frey G., Kossmann D., Zhou C., AJAXSearch: Crawling, indexing and searching web 2.0 applications, Proceedings of the 34th International Conference on Very Large Data Bases, (2008)
[55]  
Edwards J., McCurley K.S., Tomlin J.A., An adaptive model for optimizing performance of an incremental web crawler, Proceedings of the 10th International World Wide Web Conference, (2001)
[56]  
Eichmann D., The RBSE spider - Balancing effective search against web load, Proceedings of the 1st International World Wide Web Conference, (1994)
[57]  
Fetterly D., Craswell N., Vinay V., The impact of crawl policy on web search effectiveness, Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (2009)
[58]  
Fetterly D., Manasse M., Najork M., Spam, damn spam, and statistics: Using statistical analysis to locate spam web, Proceedings of the 7th International Workshop on the Web and Databases, (2004)
[59]  
Fetterly D., Manasse M., Najork M., Wiener J.L., A large-scale study of the evolution of web, Proceedings of the 12th International World Wide Web Conference, (2003)
[60]  
Fielding R., Maintaining distributed hypertext infostructures: Welcome to MOMspider's web, Proceedings of the 1st International World Wide Web Conference, (1994)