Using Web Pages Dynamicity to Prioritise Web Crawling

被引:1
|
作者
Alderratia, Nisreen [1 ]
Elsheh, Mohammed [1 ]
机构
[1] Libyan Acad Misurata, Third Ring Rd, Misurata, Libya
来源
PROCEEDINGS OF THE 2019 2ND INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND MACHINE INTELLIGENCE (MLMI 2019) | 2019年
关键词
Web crawler; importance metric; dynamicity;
D O I
10.1145/3366750.3366757
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search engine database updated. Moreover, it is fundamental to determine in the crawling process, the most important pages to be recrawled first. This is to avoid the time limitation and network issues that face the web crawling process. Thus, this research attempts to introduce a method that is used to indicate the crawler, specifically, in order to identify in what order it should recrawl web pages that have been crawled before, as to acquire more important and valuable pages earlier than others. In addition, the researchers proposed a web crawling strategy which is based on the topic similarity, accompanied with the dynamicity of web pages, where the crawler was downloading relevant pages and recrawling them recursively. Also, every time a change emerged in one of the pages, its counter increased. Therefore, if the page was relevant and changed frequently it would be considered an important page and was given a high priority in the crawling process. The obtained results indicated that using web pages' dynamicity is an effective way for prioritising web pages in the crawling process, in order to obtain the highest dynamic pages first, as there is a high possibility of being changed in terms of their content, before the least dynamic ones.
引用
收藏
页码:40 / 44
页数:5
相关论文
共 50 条
  • [1] An effective approach of web crawling for deep web
    Wang, Shunyan
    Wu, Binghua
    Zhong, Luo
    DCABES 2007 Proceedings, Vols I and II, 2007, : 855 - 858
  • [2] SMARTCRAWLER: A PERSONALIZED WEB SEARCH FOR RELEVANT WEB PAGES
    Wardekar, Arati Anilrao
    Gupta, Poonam
    2018 9TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2018,
  • [3] Utilizing RSS feeds for crawling the Web
    Adam, George
    Bouras, Christos
    Poulopoulos, Vassilis
    2009 FOURTH INTERNATIONAL CONFERENCE ON INTERNET AND WEB APPLICATIONS AND SERVICES, 2009, : 211 - 216
  • [4] Information Retrieval in Web Crawling: A Survey
    Saini, Chandni
    Arora, Vinay
    2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 2635 - 2643
  • [5] An Efficient Focused Web Crawling Approach
    Aggarwal, Kompal
    SOFTWARE ENGINEERING (CSI 2015), 2019, 731 : 131 - 138
  • [6] Crawling the Deep Web Using Asynchronous Advantage Actor Critic Technique
    Madan, Kapil
    Bhatia, Rajesh
    JOURNAL OF WEB ENGINEERING, 2021, 20 (03): : 879 - 902
  • [7] Experimental performance analysis of web crawlers using single and Multi-Threaded web crawling and indexing algorithm for the application of smart web contents
    Sharma, Arvind K.
    Shrivastava, Vandana
    Singh, Harvir
    MATERIALS TODAY-PROCEEDINGS, 2021, 37 : 1403 - 1408
  • [8] Clustering-Based Incremental Web Crawling
    Tan, Qingzhao
    Mitra, Prasenjit
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2010, 28 (04)
  • [9] Ontology based web crawling - A novel approach
    Ganesh, S
    ADVANCES IN WEB INTELLIGENCE, PROCEEDINGS, 2005, 3528 : 140 - 149
  • [10] Intelligent Crawling On Open Web for Business Prospects
    Bhushan, Bharat
    Kumar, Narender
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2012, 12 (06): : 93 - 98