Improving the freshness of the search engines by a probabilistic approach based incremental crawler

被引:0
作者
G. Pavai
T. V. Geetha
机构
[1] CEG,Department of Computer Science and Engineering
[2] Anna University,undefined
来源
Information Systems Frontiers | 2017年 / 19卷
关键词
Deep web; Incremental crawl; Bayes theorem; Information retrieval; Set covering algorithm; Semantic weighted set covering algorithm;
D O I
暂无
中图分类号
学科分类号
摘要
Web is flooded with data. While the crawler is responsible for accessing these web pages and giving it to the indexer for making them available to the users of search engine, the rate at which these web pages change has created the necessity for the crawler to employ refresh strategies to give updated/modified content to the search engine users. Furthermore, Deep web is that part of the web that has alarmingly abundant amounts of quality data (when compared to normal/surface web) but not technically accessible to a search engine’s crawler. The existing deep web crawl methods helps to access the deep web data from the result pages that are generated by filling forms with a set of queries and accessing the web databases through them. However, these methods suffer from not being able to maintain the freshness of the local databases. Both the surface web and the deep web needs an incremental crawl associated with the normal crawl architecture to overcome this problem. Crawling the deep web requires the selection of an appropriate set of queries so that they can cover almost all the records in the data source and in addition the overlapping of records should be low so that network utilization is reduced. An incremental crawl adds to an increase in the network utilization with every increment. Therefore, a reduced query set as described earlier should be used in order to minimize the network utilization. Our contributions in this work are the design of a probabilistic approach based incremental crawler to handle the dynamic changes of the surface web pages, adapting the above mentioned method with a modification to handle the dynamic changes in the deep web databases, a new evaluation measure called the ‘Crawl-hit rate’ to evaluate the efficiency of the incremental crawler in terms of the number of times the crawl is actually necessary in the predicted time and a semantic weighted set covering algorithm for reducing the queries so that the network cost is reduced for every increment of the crawl without any compromise in the number of records retrieved. The evaluation of incremental crawler shows a good improvement in the freshness of the databases and a good Crawl-hit rate (83 % for web pages and 81 % for deep web databases) with a lesser over head when compared to the baseline.
引用
收藏
页码:1013 / 1028
页数:15
相关论文
共 39 条
[1]  
Ali HA(2008)A New Approach for Building a Scalable and Adaptive Vertical Search Engine International Journal of Intelligent Information Technologies (IJIIT) 4 52-79
[2]  
El Desouky AI(2001)The deep web: surfacing hidden value Journal of Electronic Publishing 7 1174-1175
[3]  
Saleh AI(2011)Construction of Domain Ontologies: Sourcing the World Wide Web International Journal of Intelligent Information Technologies (IJIIT) 7 1-24
[4]  
Bergman MK(2007)An Attributes Correlation based Approach for Estimating Size of Web Databases Journal of Software 19 224-236
[5]  
Kim J(2010)A Framework for Incremental Hidden Web Crawler International Journal of Computer Science and Engineering 2 753-758
[6]  
Storey VC(2008)Google’s Deep Web Crawl Proceedings of the VLDB Endowment 1 1241-1252
[7]  
Ling Y(2014)A Bootstrapping Approach to Classification of Deep Web Query Interfaces International Journal on Recent Trends in Engineering and Technology 11 1-9
[8]  
Meng X(2015)A Genetic Programming Framework to Schedule Webpage Updates Information Retrieval Journal 18 73-94
[9]  
Liu W(2008)Self-Adjusting Refresh Time based Architecture for Incremental Web Crawler International Journal on Computer Science and Network Security 8 349-354
[10]  
Madaan R(2010)Clustering based Incremental Web Crawling ACM Transactions on Information Systems 2 1-25