Scalable entity-based summarization of web search results using MapReduce

被引:0
作者
Ioannis Kitsos
Kostas Magoutis
Yannis Tzitzikas
机构
[1] FORTH-ICS,Institute of Computer Science
[2] University of Crete,Computer Science Department
来源
Distributed and Parallel Databases | 2014年 / 32卷
关键词
Text data analytics through summaries and synopses; Interactive data analysis through queryable summaries and indices; Information retrieval and named entity mining; MapReduce; Cloud computing;
D O I
暂无
中图分类号
学科分类号
摘要
Although Web Search Engines index and provide access to huge amounts of documents, user queries typically return only a linear list of hits. While this is often satisfactory for focalized search, it does not provide an exploration or deeper analysis of the results. One way to achieve advanced exploration facilities exploiting the availability of structured (and semantic) data in Web search, is to enrich it with entity mining over the full contents of the search results. Such services provide the users with an initial overview of the information space, allowing them to gradually restrict it until locating the desired hits, even if they are low ranked. This is especially important in areas of professional search such as medical search, patent search, etc. In this paper we consider a general scenario of providing such services as meta-services (that is, layered over systems that support keywords search) without a-priori indexing of the underlying document collection(s). To make such services feasible for large amounts of data we use the MapReduce distributed computation model on a Cloud infrastructure (Amazon EC2). Specifically, we show how the required computational tasks can be factorized and expressed as MapReduce functions. A key contribution of our work is a thorough evaluation of platform configuration and tuning, an aspect that is often disregarded and inadequately addressed in prior work, but crucial for the efficient utilization of resources. Finally we report experimental results about the achieved speedup in various settings.
引用
收藏
页码:405 / 446
页数:41
相关论文
共 48 条
[1]  
Armbrust M.(2010)A view of cloud computing Commun. ACM 53 50-58
[2]  
Fox A.(2010)Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics World Pat. Inf. 32 30-38
[3]  
Griffith R.(2002)A taxonomy of web search SIGIR Forum 36 3-10
[4]  
Joseph A.D.(2012)Evaluating subtopic retrieval methods: clustering versus diversification of search results Inf. Process. Manag. 48 358-373
[5]  
Katz R.(2007)A survey on automatic text summarization Literature Survey for the Language and Statistics II course at CMU 4 192-195
[6]  
Konwinski A.(2008)Mapreduce: simplified data processing on large clusters Commun. ACM 51 107-113
[7]  
Lee G.(2008)Data mining using high performance data clouds: experimental studies using sector and sphere CoRR 3019 920-927
[8]  
Patterson D.(2001)Answering queries using views: a survey VLDB J. 10 270-294
[9]  
Rabkin A.(1912)The distribution of the flora in the alpine zone New Phytol. 11 37-50
[10]  
Stoica I.(2011)Building wavelet histograms on large data in mapreduce Proc. VLDB Endow. 5 109-120