Efficient watcher based web crawler design

被引:5
作者
Alqaraleh, Saed [1 ]
Ramadan, Omar [1 ]
Salamah, Muhammed [1 ]
机构
[1] Eastern Mediterranean Univ, Famagusta, Turkey
关键词
Information retrieval; Search engine; AJAX crawler; Crawler re-visiting policies; Crawling algorithm; Static crawler; AJAX;
D O I
10.1108/AJIM-02-2015-0019
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose - The purpose of this paper is to design a watcher-based crawler (WBC) that has the ability of crawling static and dynamic web sites, and can download only the updated and newly added web pages. Design/methodology/approach - In the proposed WBC crawler, a watcher file, which can be uploaded to the web sites servers, prepares a report that contains the addresses of the updated and the newly added web pages. In addition, the WBC is split into five units, where each unit is responsible for performing a specific crawling process. Findings - Several experiments have been conducted and it has been observed that the proposed WBC increases the number of uniquely visited static and dynamic web sites as compared with the existing crawling techniques. In addition, the proposed watcher file not only allows the crawlers to visit the updated and newly web pages, but also solves the crawlers overlapping and communication problems. Originality/value - The proposed WBC performs all crawling processes in the sense that it detects all updated and newly added pages automatically without any human explicit intervention or downloading the entire web sites.
引用
收藏
页码:663 / 686
页数:24
相关论文
共 37 条
[1]  
Agarwal A., 2012, INT J ADV RES COMPUT, V2, P147
[2]   A heuristic hierarchical scheme for academic search and retrieval [J].
Amolochitis, Emmanouil ;
Christou, Ioannis T. ;
Tan, Zheng-Hua ;
Prasad, Ramjee .
INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (06) :1326-1343
[3]  
[Anonymous], 2013, INT J COMPUT APPL
[4]  
[Anonymous], 2014, INT J COMPUTER SCI I
[5]  
[Anonymous], 2012, Advances in neural information processing systems
[6]  
[Anonymous], 2012, P 2012 C CTR ADV STU
[7]  
Apache Nutch, 2015, NUTCH 2 3
[8]  
Bhushan B., 2012, INT J COMPUTING BUSI, V3
[9]  
Bhute A. N., 2010, INT C TRENDS ADV COM, P211
[10]  
Brawer S. B., 2013, US Patent Application, Patent No. [13/858,872, 13858872]