Design of a vertical search engine for synchrotron data: a big data approach using Hadoop ecosystem

被引:0
作者
Ali Khaleghi
Kamran Mahmoudi
Sonia Mozaffari
机构
[1] Imam Khomeini International University,
来源
SN Applied Sciences | 2019年 / 1卷
关键词
Synchrotron; Search Engine; Information retrieval; Big data; Hadoop; Solr; Nutch;
D O I
暂无
中图分类号
学科分类号
摘要
A synchrotron as an experimental physics facility can provide the opportunity of a multi-disciplinary research and collaboration between scientists in various fields of study such as physics, chemistry, etc. During the construction and operation of such facility valuable data regarding the design of the facility, instruments and conducted experiments are published and stored. It takes researchers a long time going through different results from generalized search engines to find their needed scientific information so that the design of a domain specific search engine can help researchers to find their desired information with greater precision. It also provides the opportunity to use the crawled data to create a knowledgebase and also to generate different datasets required by the researchers. There have been several other vertical search engines that are designed for scientific data search such as medical information. In this paper we propose the design of such search engine on top of the Apache Hadoop framework. Usage of Hadoop ecosystem provides the necessary features such as scalability, fault tolerance and availability. It also abstracts the complexities of search engine design by using different open source tools as building blocks, among them Apache Nutch for the crawling block and Apache Solr for indexing and query processing. Our primary results obtained by implementing the proposed method in single node mode, the index of over a hundred thousand pages was created with the average fetch interval of 30 days having 28 segments and approximately 570 MB size. The performance factors such as the usage of available bandwidth and system load were logged using Linux’s sysstat package.
引用
收藏
相关论文
共 5 条
  • [1] Saha TK(2013)Domain specific custom search for quicker information retrieval Int J Inf Retr Res 3 26-39
  • [2] Shawkat Ali ABM(2017)Study of distributed file system for big data Int J Innov Res Comput Commun Eng 5 1435-1438
  • [3] Zalte SA(undefined)undefined undefined undefined undefined-undefined
  • [4] Takate VR(undefined)undefined undefined undefined undefined-undefined
  • [5] Chaudhari SR(undefined)undefined undefined undefined undefined-undefined