MRShare: Sharing Across Multiple Queries in MapReduce

被引:104
作者
Nykiel, Tomasz [1 ]
Potamias, Michalis [2 ]
Mishra, Chaitanya [3 ]
Kollios, George [2 ]
Koudas, Nick [1 ]
机构
[1] Univ Toronto, Toronto, ON, Canada
[2] Boston Univ, Boston, MA USA
[3] Facebook, Menlo Pk, CA USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2010年 / 3卷 / 01期
基金
美国国家科学基金会;
关键词
D O I
10.14778/1920841.1920906
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure (e.g., Amazon EC2) can be directly mapped to monetary value. MapReduce has been a popular framework in the context of cloud computing, designed to serve long running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges incurred while utilizing the processing infrastructure. In this paper we propose a sharing framework tailored to MapReduce. Our framework, MRShare, transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Experiments in our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach and substantial savings.
引用
收藏
页码:494 / 505
页数:12
相关论文
共 22 条
[1]  
Abouzeid A., 2009, VLDB
[2]   Scheduling Shared Scans of Large Data Files [J].
Agrawal, Parag ;
Kifer, Daniel ;
Olston, Christopher .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01) :958-969
[3]  
Candea G., 2009, VLDB
[4]   SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets [J].
Chaiken, Ronnie ;
Jenkins, Bob ;
Larson, Per-Ake ;
Ramsey, Bill ;
Shakib, Darren ;
Weaver, Simon ;
Zhou, Jingren .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (02) :1265-1276
[5]   Optimization of queries with user-defined predicates [J].
Chaudhuri, S ;
Shim, K .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 1999, 24 (02) :177-228
[6]  
Chih Yang H., 2007, P 2007 ACM SIGMOD IN, P1029, DOI DOI 10.1145/1247480.1247602
[7]   MAD Skills: New Analysis Practices for Big Data [J].
Cohen, Jeffrey ;
Dolan, Brian ;
Dunlap, Mark ;
Hellerstein, Joseph M. ;
Welton, Caleb .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (02) :1481-1492
[8]  
Dean J., OSDI 04, P137
[9]  
Finkelstein S.J., 1982, P 1982 INT C MAN DAT, P235, DOI [10.1145/582353.582400, DOI 10.1145/582353.582400]
[10]  
Friedman E., 2009, VLDB