Web crawling

被引:153
作者
Olston C. [1 ]
Najork M. [2 ]
机构
[1] Yahoo Research, Sunnyvale, CA, 94089
[2] Microsoft Research, Mountain View, CA, 94043
来源
Foundations and Trends in Information Retrieval | 2010年 / 4卷 / 03期
关键词
D O I
10.1561/1500000017
中图分类号
学科分类号
摘要
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work. © 2010 C. Olston and M. Najork.
引用
收藏
页码:175 / 246
页数:71
相关论文
共 117 条
[61]  
Gao W., Lee H.C., Miao Y., Geographically focused collaborative crawling, Proceedings of the 15th International World Wide Web Conference, (2006)
[62]  
GigaAlert
[63]  
Gomes D., Silva M.J., Characterizing a national community web, ACM Transactions on Internet Technology, 5, 3, pp. 508-531, (2005)
[64]  
Gravano L., Garcia-Molina H., Tomasic A., The effectiveness of GlOSS for the text database discovery problem, Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, (1994)
[65]  
Gray M., Internet growth and statistics: Credits and background
[66]  
Gruhl D., Chavet L., Gibson D., Meyer J., Pattanayak P., Tomkins A., Zien J., How to build a WebFountain: An architecture for very large-scale text analytics, IBM Systems Journal, 43, 1, pp. 64-77, (2004)
[67]  
Gyongyi Z., Garcia-Molina H., Web Spam Taxonomy, Proceedings of the 1st International Workshop on Adversarial Information Retrieval, (2005)
[68]  
Hafri Y., Djeraba C., High performance crawling system, Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, (2004)
[69]  
Henzinger M., Heydon A., Mitzenmacher M., Najork M., Measuring index quality using random walks on the web, Proceedings of the 8th International World Wide Web Conference, (1999)
[70]  
Henzinger M., Heydon A., Mitzenmacher M., Najork M., On nearuniform URL sampling, Proceedings of the 9th International World Wide Web Conference, (2000)