Using Web Pages Dynamicity to Prioritise Web Crawling

被引:1
|
作者
Alderratia, Nisreen [1 ]
Elsheh, Mohammed [1 ]
机构
[1] Libyan Acad Misurata, Third Ring Rd, Misurata, Libya
来源
PROCEEDINGS OF THE 2019 2ND INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND MACHINE INTELLIGENCE (MLMI 2019) | 2019年
关键词
Web crawler; importance metric; dynamicity;
D O I
10.1145/3366750.3366757
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Web crawling is a process performed to collect web pages from the web, in order to be indexed and used for displaying the search results according to users' requirements. In addition, web crawlers must continually revisit web pages, to keep the search engine database updated. Moreover, it is fundamental to determine in the crawling process, the most important pages to be recrawled first. This is to avoid the time limitation and network issues that face the web crawling process. Thus, this research attempts to introduce a method that is used to indicate the crawler, specifically, in order to identify in what order it should recrawl web pages that have been crawled before, as to acquire more important and valuable pages earlier than others. In addition, the researchers proposed a web crawling strategy which is based on the topic similarity, accompanied with the dynamicity of web pages, where the crawler was downloading relevant pages and recrawling them recursively. Also, every time a change emerged in one of the pages, its counter increased. Therefore, if the page was relevant and changed frequently it would be considered an important page and was given a high priority in the crawling process. The obtained results indicated that using web pages' dynamicity is an effective way for prioritising web pages in the crawling process, in order to obtain the highest dynamic pages first, as there is a high possibility of being changed in terms of their content, before the least dynamic ones.
引用
收藏
页码:40 / 44
页数:5
相关论文
共 50 条
  • [31] Analyzing Web Security Features using Crawlers: Study of Croatian Web
    Stambuk, Edi
    Gros, Stjepan
    Vukovic, Marin
    PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS (CONTEL 2021), 2021, : 142 - 145
  • [32] Emergency Event Web Information Acquisition using Crowd Web Sensors
    Xiao Wei
    Hong Hu
    Daniel Dajun Zeng
    Wei Wu
    Wireless Personal Communications, 2017, 95 : 2393 - 2411
  • [33] Emergency Event Web Information Acquisition using Crowd Web Sensors
    Wei, Xiao
    Hu, Hong
    Zeng, Daniel Dajun
    Wu, Wei
    WIRELESS PERSONAL COMMUNICATIONS, 2017, 95 (03) : 2393 - 2411
  • [34] How do job vacancy rates predict firm performance? A web crawling massive data perspective
    Lo, Huai-Chun
    Koedijk, Kees G.
    Gao, Xiang
    Hsu, Yuan-Teng
    PACIFIC-BASIN FINANCE JOURNAL, 2020, 62
  • [35] Application of VM-Based Computations to Speedup the Web Crawling Process on Multi-Core Processors
    Al-Bahadili, Hussein
    Qtishat, Hamzah
    2013 12TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED COMPUTING AND APPLICATIONS TO BUSINESS, ENGINEERING & SCIENCE (DCABES), 2013, : 157 - 161
  • [36] Using Large Language Model to Fill in Web Forms to Support Automated Web Application Testing
    Chen, Feng-Kai
    Liu, Chien-Hung
    You, Shingchern D.
    INFORMATION, 2025, 16 (02)
  • [37] Design of a Mobile Web Crawler for Hidden Web
    Kumar, Manish
    Bhatia, Rajesh
    2016 3rd International Conference on Recent Advances in Information Technology (RAIT), 2016, : 186 - 190
  • [38] Mitigating Web Scrapers using Markup Randomization
    Bolbol, Noor
    Barhoom, Tawfiq
    2021 PALESTINIAN INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY (PICICT 2021), 2021, : 157 - 162
  • [39] USING WEB CRAWLER TECHNOLOGY TO SUPPORT DESIGN-RELATED WEB INFORMATION COLLECTION IN IDEA GENERATION
    Wang, Zhihua
    Childs, Peter R. N.
    Jiang, Pingfei
    DESIGN FOR HARMONIES, VOL 6: DESIGN INFORMATION AND KNOWLEDGE, 2013,
  • [40] SIMHAR-Smart Distributed Web Crawler for the Hidden Web Using SIM plus Hash and Redis Server
    Kaur, Sawroop
    Geetha, G.
    IEEE ACCESS, 2020, 8 : 117582 - 117592