Using Web Pages Dynamicity to Prioritise Web Crawling

被引:1
|
作者
Alderratia, Nisreen [1 ]
Elsheh, Mohammed [1 ]
机构
[1] Libyan Acad Misurata, Third Ring Rd, Misurata, Libya
来源
PROCEEDINGS OF THE 2019 2ND INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND MACHINE INTELLIGENCE (MLMI 2019) | 2019年
关键词
Web crawler; importance metric; dynamicity;
D O I
10.1145/3366750.3366757
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search engine database updated. Moreover, it is fundamental to determine in the crawling process, the most important pages to be recrawled first. This is to avoid the time limitation and network issues that face the web crawling process. Thus, this research attempts to introduce a method that is used to indicate the crawler, specifically, in order to identify in what order it should recrawl web pages that have been crawled before, as to acquire more important and valuable pages earlier than others. In addition, the researchers proposed a web crawling strategy which is based on the topic similarity, accompanied with the dynamicity of web pages, where the crawler was downloading relevant pages and recrawling them recursively. Also, every time a change emerged in one of the pages, its counter increased. Therefore, if the page was relevant and changed frequently it would be considered an important page and was given a high priority in the crawling process. The obtained results indicated that using web pages' dynamicity is an effective way for prioritising web pages in the crawling process, in order to obtain the highest dynamic pages first, as there is a high possibility of being changed in terms of their content, before the least dynamic ones.
引用
收藏
页码:40 / 44
页数:5
相关论文
共 50 条
  • [11] GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources
    Huang, Chih-Yuan
    Chang, Hao
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2016, 5 (08)
  • [12] Keyword weight optimization using gradient strategies in event focused web crawling
    Rajiv, S.
    Navaneethan, C.
    PATTERN RECOGNITION LETTERS, 2021, 142 : 3 - 10
  • [13] A Survey on Content Based Crawling for Deep and Surface Web
    Agrawal, Nishchay
    Johari, Suchi
    2019 FIFTH INTERNATIONAL CONFERENCE ON IMAGE INFORMATION PROCESSING (ICIIP 2019), 2019, : 491 - 496
  • [14] RCrawler: An R package for parallel web crawling and scraping
    Khalil, Salim
    Fakir, Mohamed
    SOFTWAREX, 2017, 6 : 98 - 106
  • [15] EasySpider: A No-Code Visual System for Crawling the Web
    Wang, Naibo
    Feng, Wenjie
    Yin, Jianwei
    Ng, See-Kiong
    COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 192 - 195
  • [16] GUIDE: an interactive and incremental approach for crawling Web applications
    Liu, Chien-Hung
    Chen, Woei-Kae
    Sun, Chi-Chia
    JOURNAL OF SUPERCOMPUTING, 2020, 76 (03): : 1562 - 1584
  • [17] GUIDE: an interactive and incremental approach for crawling Web applications
    Chien-Hung Liu
    Woei-Kae Chen
    Chi-Chia Sun
    The Journal of Supercomputing, 2020, 76 : 1562 - 1584
  • [18] Psychonauts' psychedelics: A systematic, multilingual, web-crawling exercise
    Catalani, Valeria
    Corkery, John Martin
    Guirguis, Amira
    Napoletano, Flavia
    Arillotta, Davide
    Zangani, Caroline
    Vento, Alessandro
    Schifano, Fabrizio
    EUROPEAN NEUROPSYCHOPHARMACOLOGY, 2021, 49 : 69 - 92
  • [19] The Implementation of Crawling News Page Based On Incremental Web Crawler
    Shi, Zejian
    Shi, Minyong
    Lin, Weiguo
    2016 4TH INTL CONF ON APPLIED COMPUTING AND INFORMATION TECHNOLOGY/3RD INTL CONF ON COMPUTATIONAL SCIENCE/INTELLIGENCE AND APPLIED INFORMATICS/1ST INTL CONF ON BIG DATA, CLOUD COMPUTING, DATA SCIENCE & ENGINEERING (ACIT-CSII-BCD), 2016, : 348 - 351
  • [20] Naive Bayes based Language-Specific Web Crawling
    Srisukha, Ekkasit
    Jinarat, Supakpong
    Haruechaiyasak, Choochart
    Rungsawang, Arnon
    ECTI-CON 2008: PROCEEDINGS OF THE 2008 5TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING/ELECTRONICS, COMPUTER, TELECOMMUNICATIONS AND INFORMATION TECHNOLOGY, VOLS 1 AND 2, 2008, : 113 - +