Dynamic Configuration of Partitioning in Spark Applications

被引：53

作者：

Gounaris, Anastasios ^{[1
]}

Kougka, Georgia ^{[2
]}

Tous, Ruben ^{[3
]}

Montes, Carlos Tripiana ^{[4
]}

Torres, Jordi ^{[4
]}

机构：

[1] Aristotle Univ Thessaloniki, Dept Informat, Thessaloniki 54124, Greece

[2] Aristotle Univ Thessaloniki, Thessaloniki, Greece

[3] Univ Politecn Cataluna, ES-08034 Barcelona, Spain

[4] Barcelona Supercomp Ctr, Barcelona 08034, Spain

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2017年 / 28卷 / 07期

关键词：

Data repartitioning; data flow optimization; data flow profiling; spark; OPTIMIZATION;

D O I：

10.1109/TPDS.2017.2647939

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Spark has become one of the main options for large-scale analytics running on top of shared-nothing clusters. This work aims to make a deep dive into the parallelism configuration and shed light on the behavior of parallel spark jobs. It is motivated by the fact that running a Spark application on all the available processors does not necessarily imply lower running time, while may entail waste of resources. We first propose analytical models for expressing the running time as a function of the number of machines employed. We then take another step, namely to present novel algorithms for configuring dynamic partitioning with a view to minimizing resource consumption without sacrificing running time beyond a user-defined limit. The problem we target is NP-hard. To tackle it, we propose a greedy approach after introducing the notions of dependency graphs and of the benefit from modifying the degree of partitioning at a stage; complementarily, we investigate a randomized approach. Our polynomial solutions are capable of judiciously use the resources that are potentially at user's disposal and strike interesting trade-offs between running time and resource consumption. Their efficiency is thoroughly investigated through experiments based on real execution data.

引用

页码：1891 / 1904

页数：14

共 31 条

[1]

[Anonymous], 2012, HOTCDP 12

[2]

[Anonymous], 2015, NSDI

[3] Spark SQL: Relational Data Processing in Spark [J].

Armbrust, Michael ;

Xin, Reynold S. ;

Lian, Cheng ;

Huai, Yin ;

Liu, Davies ;

Bradley, Joseph K. ;

Meng, Xiangrui ;

Kaftan, Tomer ;

Franklint, Michael J. ;

Ghodsi, Ali ;

Zaharia, Matei .

SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, :1383-1394

[4] Execution Time Estimation for Workflow Scheduling [J].

Chirkin, Artem M. ;

Belloum, A. S. Z. ;

Kovalchuk, Sergey V. ;

Makkes, Marc X. .

2014 9TH WORKSHOP ON WORKFLOWS IN SUPPORT OF LARGE-SCALE SCIENCE (WORKS), 2014, :1-10

[5]

Crotty A, 2015, PROC VLDB ENDOW, V8, P1466

[6]

Duan R., 2007, SC 07, P1

[7]

Graham B., 2013, DO YOU HADOOP SURVEY

[8] Hadoop Superlinear Scalability [J].

Gunther, Neil J. ;

Puglia, Paul ;

Tomasette, Kristofer .

COMMUNICATIONS OF THE ACM, 2015, 58 (04) :46-55

[9]

Herodotou H., 2011, P 2 ACM S CLOUD COMP, V18, P14

[10]

Herodotou H, 2011, PROC VLDB ENDOW, V4, P1111

← 1 2 3 4 →