Efficient Multi-threaded Crawling Using In Memory Data Structures

被引:0
|
作者
Abdeen, Mohammad A. R. [1 ]
机构
[1] Islamic Univ Madinah, Fac Comp & Informat Syst, Madinah, Saudi Arabia
关键词
Web Crawlers; Distributed Applications; Multi-threading; In-memory Data Structures; Performance Evaluation;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Crawling the internet is an important task for any search engine. A crawler is a software program that sends HTTP requests to various webservers available on the world datasphere and downloads their contents. As the size of the internet has gone through a big bang in the last decade, designing efficient parallel crawlers became a necessity. One of the factors that degrades the crawler performance is the disk access every time a file is written. As the process of crawling the web requires the download of tens or hundreds of millions of webpages, much time will be consumed in disk writes due to the seek times. This work presents an efficient multi-threaded crawler that incorporates an in-memory data structure to reduce the overall disk write times. The results show that the proposed technique can increase the throughput by about 50% at selected values of size of the in-memory data structure over the normal multi-threaded crawler with no in-memory data structure. In addition, the results show that this design can achieve an average crawler speed of 22 pages/sec which supersedes previously reported work.
引用
收藏
页码:88 / 92
页数:5
相关论文
共 50 条
  • [31] Multi-threaded Active Objects
    Henrio, Ludovic
    Huet, Fabrice
    Istvan, Zsolt
    COORDINATION MODELS AND LANGUAGES, COORDINATION 2013, 2013, 7890 : 90 - 104
  • [32] Multi-threaded Object Streaming
    Di Guida, Salvatore
    Govi, Giacomo
    Ojeda, Miguel
    Pfeiffer, Andreas
    Sipos, Roland
    21ST INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP2015), PARTS 1-9, 2015, 664
  • [33] Toward a multi-threaded glish
    Schiebel, DR
    ASTRONOMICAL DATA ANALYSIS SOFTWARE AND SYSTEMS XI, 2002, 281 : 164 - 168
  • [34] Speculative Parallelization Using Software Multi-threaded Transactions
    Raman, Arun
    Kim, Hanjun
    Mason, Thomas R.
    Jablin, Thomas B.
    August, David I.
    ASPLOS XV: FIFTEENTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, 2010, : 65 - 76
  • [35] A Multi-threaded network interface using network processors
    Cascon, Pablo
    Ortega, Julio
    Haider, Waseem M.
    Diaz, Antonio F.
    Rojas, Ignacio
    PROCEEDINGS OF THE PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, 2009, : 196 - 200
  • [36] Speculative Parallelization Using Software Multi-threaded Transactions
    Raman, Arun
    Kim, Hanjun
    Mason, Thomas R.
    Jablin, Thomas B.
    August, David I.
    ACM SIGPLAN NOTICES, 2010, 45 (03) : 65 - 76
  • [37] Multi-Threaded Streamline Tracing for Data-Intensive Architectures
    Jiang, Ming
    Van Essen, Brian
    Harrison, Cyrus
    Gokhale, Maya
    2014 IEEE 4TH SYMPOSIUM ON LARGE DATA ANALYSIS AND VISUALIZATION (LDAV), 2014, : 11 - 18
  • [38] Enabling Multi-threaded Applications on Hybrid Shared Memory Manycore Architectures
    Rawat, Tushar
    Shrivastava, Aviral
    2015 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2015, : 742 - 747
  • [39] Practical Multi-threaded Graph Coloring Algorithms for Shared Memory Architecture
    Singhal, Nandini
    Peri, Sathya
    Kalyanasundaram, Subrahmanyam
    18TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING AND NETWORKING (ICDCN 2017), 2017,
  • [40] Upgrade of ATLAS data quality monitoring for multi-threaded reconstruction
    Bold, Tomasz
    Lampl, Walter
    Narayan, Rohin
    Onyisi, Peter
    Sarna, Piotr
    23RD INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2018), 2019, 214