Sharing across Multiple MapReduce Jobs

被引：7

作者：

Nykiel, Tomasz ^{[1
]}

Potamias, Michalis ^{[2
]}

Mishra, Chaitanya ^{[3
]}

Kollios, George ^{[4
]}

Koudas, Nick ^{[1
]}

机构：

[1] Univ Toronto, Toronto, ON M5S 1A1, Canada

[2] Groupon, Chicago, IL USA

[3] Facebook Inc, Menlo Pk, CA USA

[4] Boston Univ, Boston, MA 02215 USA

来源：

ACM TRANSACTIONS ON DATABASE SYSTEMS | 2014年 / 39卷 / 02期

关键词：

Algorithms; Sharing MapReduce jobs; systems; MapReduce; query processing; EFFICIENT; QUERIES; SYSTEM;

D O I：

10.1145/2560796

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure can be directly associated with monetary cost. MapReduce has been a popular framework in the context of cloud computing, designed to serve long-running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges for utilizing the processing infrastructure. In this article we present a sharing framework tailored to MapReduce, namely, MRShare. Our framework, MRShare, transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Given the query grouping, we merge jobs appropriately and submit them to MapReduce for processing. A key property of MRShare is that it is independent of the MapReduce implementation. Experiments with our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach. MRShare is primarily designed for handling I/O-intensive queries. However, with the development of high-level languages operating on top of MapReduce, user queries executed in this model become more complex and CPU intensive. Commonly, executed queries can be modeled as evaluating pipelines of CPU-expensive filters over the input stream. Examples of such filters include, but are not limited to, index probes, or certain types of joins. In this article we adapt some of the standard techniques for filter ordering used in relational and stream databases, propose their extensions, and implement them through MRAdaptiveFilter, an extension of MRShare for expensive filter ordering tailored to MapReduce, which allows one to handle both single-and batch-query execution modes. We present an experimental evaluation that demonstrates additional benefits of MRAdaptiveFilter, when executing CPU-intensive queries in MRShare.

引用

页数：46

共 50 条

[1] HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs
Tian, Wenhong
Li, Guozhong
Yang, Wutong
Buyya, Rajkumar
JOURNAL OF SUPERCOMPUTING, 2016, 72 (06): : 2376 - 2393
[2] HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs
Wenhong Tian
Guozhong Li
Wutong Yang
Rajkumar Buyya
The Journal of Supercomputing, 2016, 72 : 2376 - 2393
[3] Orchestrating an Ensemble of MapReduce Jobs for Minimizing Their Makespan
Verma, Abhishek
Cherkasova, Ludmila
Campbell, Roy H.
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2013, 10 (05) : 314 - 327
[4] MapReduce++ - Efficient processing of MapReduce jobs in the cloud
Zhang, Guigang
Li, Chao
Zhang, Yong
Xing, Chunxiao
Yang, Jijiang
Journal of Computational Information Systems, 2012, 8 (14): : 5757 - 5764
[5] Achieving Elasticity for Cloud MapReduce Jobs
Salah, Khaled
Calero, Jose M. Alcaraz
PROCEEDINGS OF THE 2013 IEEE 2ND INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (CLOUDNET), 2013, : 195 - 199
[6] Malleable scheduling for flows of jobs and applications to MapReduce
Viswanath Nagarajan
Joel Wolf
Andrey Balmin
Kirsten Hildrum
Journal of Scheduling, 2019, 22 : 393 - 411
[7] Scheduling MapReduce Jobs on Identical and Unrelated Processors
Fotakis, Dimitris
Milis, Ioannis
Papadigenopoulos, Orestis
Vassalos, Vasilis
Zois, Georgios
THEORY OF COMPUTING SYSTEMS, 2020, 64 (05) : 754 - 782
[8] TOTAL WEIGHTED TARDINESS FOR SCHEDULING MAPREDUCE JOBS ON PARALLEL BATCH MACHINES
Wang, Zhaojie
Zheng, Feifeng
Xu, Yinfeng
Liu, Ming
Sun, Lihua
JOURNAL OF INDUSTRIAL AND MANAGEMENT OPTIMIZATION, 2023, 19 (08) : 5953 - 5968
[9] Scheduling MapReduce Jobs on Identical and Unrelated Processors
Dimitris Fotakis
Ioannis Milis
Orestis Papadigenopoulos
Vasilis Vassalos
Georgios Zois
Theory of Computing Systems, 2020, 64 : 754 - 782
[10] Marimba: A Framework for Making MapReduce Jobs Incremental
Schildgen, Johannes
Joerg, Thomas
Hoffmann, Manuel
Dessloch, Stefan
2014 IEEE INTERNATIONAL CONGRESS ON BIG DATA (BIGDATA CONGRESS), 2014, : 128 - 135

← 1 2 3 4 5 →