Sharing across Multiple MapReduce Jobs

被引:7
|
作者
Nykiel, Tomasz [1 ]
Potamias, Michalis [2 ]
Mishra, Chaitanya [3 ]
Kollios, George [4 ]
Koudas, Nick [1 ]
机构
[1] Univ Toronto, Toronto, ON M5S 1A1, Canada
[2] Groupon, Chicago, IL USA
[3] Facebook Inc, Menlo Pk, CA USA
[4] Boston Univ, Boston, MA 02215 USA
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2014年 / 39卷 / 02期
关键词
Algorithms; Sharing MapReduce jobs; systems; MapReduce; query processing; EFFICIENT; QUERIES; SYSTEM;
D O I
10.1145/2560796
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure can be directly associated with monetary cost. MapReduce has been a popular framework in the context of cloud computing, designed to serve long-running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges for utilizing the processing infrastructure. In this article we present a sharing framework tailored to MapReduce, namely, MRShare. Our framework, MRShare, transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Given the query grouping, we merge jobs appropriately and submit them to MapReduce for processing. A key property of MRShare is that it is independent of the MapReduce implementation. Experiments with our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach. MRShare is primarily designed for handling I/O-intensive queries. However, with the development of high-level languages operating on top of MapReduce, user queries executed in this model become more complex and CPU intensive. Commonly, executed queries can be modeled as evaluating pipelines of CPU-expensive filters over the input stream. Examples of such filters include, but are not limited to, index probes, or certain types of joins. In this article we adapt some of the standard techniques for filter ordering used in relational and stream databases, propose their extensions, and implement them through MRAdaptiveFilter, an extension of MRShare for expensive filter ordering tailored to MapReduce, which allows one to handle both single-and batch-query execution modes. We present an experimental evaluation that demonstrates additional benefits of MRAdaptiveFilter, when executing CPU-intensive queries in MRShare.
引用
收藏
页数:46
相关论文
共 50 条
  • [1] HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs
    Tian, Wenhong
    Li, Guozhong
    Yang, Wutong
    Buyya, Rajkumar
    JOURNAL OF SUPERCOMPUTING, 2016, 72 (06): : 2376 - 2393
  • [2] HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs
    Wenhong Tian
    Guozhong Li
    Wutong Yang
    Rajkumar Buyya
    The Journal of Supercomputing, 2016, 72 : 2376 - 2393
  • [3] Orchestrating an Ensemble of MapReduce Jobs for Minimizing Their Makespan
    Verma, Abhishek
    Cherkasova, Ludmila
    Campbell, Roy H.
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2013, 10 (05) : 314 - 327
  • [4] MapReduce++ - Efficient processing of MapReduce jobs in the cloud
    Zhang, Guigang
    Li, Chao
    Zhang, Yong
    Xing, Chunxiao
    Yang, Jijiang
    Journal of Computational Information Systems, 2012, 8 (14): : 5757 - 5764
  • [5] Achieving Elasticity for Cloud MapReduce Jobs
    Salah, Khaled
    Calero, Jose M. Alcaraz
    PROCEEDINGS OF THE 2013 IEEE 2ND INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (CLOUDNET), 2013, : 195 - 199
  • [6] Malleable scheduling for flows of jobs and applications to MapReduce
    Viswanath Nagarajan
    Joel Wolf
    Andrey Balmin
    Kirsten Hildrum
    Journal of Scheduling, 2019, 22 : 393 - 411
  • [7] Scheduling MapReduce Jobs on Identical and Unrelated Processors
    Fotakis, Dimitris
    Milis, Ioannis
    Papadigenopoulos, Orestis
    Vassalos, Vasilis
    Zois, Georgios
    THEORY OF COMPUTING SYSTEMS, 2020, 64 (05) : 754 - 782
  • [8] TOTAL WEIGHTED TARDINESS FOR SCHEDULING MAPREDUCE JOBS ON PARALLEL BATCH MACHINES
    Wang, Zhaojie
    Zheng, Feifeng
    Xu, Yinfeng
    Liu, Ming
    Sun, Lihua
    JOURNAL OF INDUSTRIAL AND MANAGEMENT OPTIMIZATION, 2023, 19 (08) : 5953 - 5968
  • [9] Scheduling MapReduce Jobs on Identical and Unrelated Processors
    Dimitris Fotakis
    Ioannis Milis
    Orestis Papadigenopoulos
    Vasilis Vassalos
    Georgios Zois
    Theory of Computing Systems, 2020, 64 : 754 - 782
  • [10] Marimba: A Framework for Making MapReduce Jobs Incremental
    Schildgen, Johannes
    Joerg, Thomas
    Hoffmann, Manuel
    Dessloch, Stefan
    2014 IEEE INTERNATIONAL CONGRESS ON BIG DATA (BIGDATA CONGRESS), 2014, : 128 - 135