Using machine learning to optimize parallelism in big data applications

被引:48
作者
Brandon Hernandez, Alvaro [1 ]
Perez, Maria S. [1 ]
Gupta, Smrati [2 ]
Muntes-Mulero, Victor [2 ]
机构
[1] Univ Politecn Madrid, Ontol Engn Grp, Calle Ciruelos, E-28660 Madrid, Spain
[2] CA Technol, Pl Pau,WTC Almeda Pk Edif 2 Planta 4, Barcelona 08940, Spain
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2018年 / 86卷
基金
欧盟地平线“2020”;
关键词
Machine learning; Spark; Parallelism; Big data;
D O I
10.1016/j.future.2017.07.003
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In-memory cluster computing platforms have gained momentum in the last years, due to their ability to analyse big amounts of data in parallel. These platforms are complex and difficult-to-manage environments. In addition, there is a lack of tools to better understand and optimize such platforms that consequently form the backbone of big data infrastructure and technologies. This directly leads to underutilization of available resources and application failures in such environment. One of the key aspects that can address this problem is optimization of the task parallelism of application in such environments. In this paper, we propose a machine learning based method that recommends optimal parameters for task parallelization in big data workloads. By monitoring and gathering metrics at system and application level, we are able to find statistical correlations that allow us to characterize and predict the effect of different parallelism settings on performance. These predictions are used to recommend an optimal configuration to users before launching their workloads in the cluster, avoiding possible failures, performance degradation and wastage of resources. We evaluate our method with a benchmark of 15 Spark applications on the Grid5000 testbed. We observe up to a 51% gain on performance when using the recommended parallelism settings. The model is also interpretable and can give insights to the user into how different metrics and parameters affect the performance. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:1076 / 1092
页数:17
相关论文
共 34 条
[1]   The Stratosphere platform for big data analytics [J].
Alexandrov, Alexander ;
Bergmann, Rico ;
Ewen, Stephan ;
Freytag, Johann-Christoph ;
Hueske, Fabian ;
Heise, Arvid ;
Kao, Odej ;
Leich, Marcus ;
Leser, Ulf ;
Markl, Volker ;
Naumann, Felix ;
Peters, Mathias ;
Rheinlaender, Astrid ;
Sax, Matthias J. ;
Schelter, Sebastian ;
Hoeger, Mareike ;
Tzoumas, Kostas ;
Warneke, Daniel .
VLDB JOURNAL, 2014, 23 (06) :939-964
[2]  
[Anonymous], 2013, TECH REP
[3]  
[Anonymous], 2011, NSDI, DOI DOI 10.1016/0375-6505(85)90011-2
[4]  
Balouek D, 2013, COMM COM INF SC, V367, P3
[5]   ALOJA-ML: A Framework for Automating Characterization and Knowledge Discovery in Hadoop Deployments [J].
Berral, Josep Ll. ;
Poggi, Nicolas ;
Carrera, David ;
Call, Aaron ;
Reinauer, Rob ;
Green, Daron .
KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, :1701-1710
[6]  
Bruno Nicolas., 2012, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, P805, DOI DOI 10.1145/2213836.2213959
[7]   Improving MapReduce Performance in Heterogeneous Environments with Adaptive Task Tuning [J].
Cheng, Dazhao ;
Rao, Jia ;
Guo, Yanfei ;
Zhou, Xiaobo .
ACM/IFIP/USENIX MIDDLEWARE 2014, 2014, :97-108
[8]  
Chung I-Hsin., 2004, SC '04, P30, DOI [10.1109/SC.2004.65, DOI 10.1109/SC.2004.65]
[9]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[10]   Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters [J].
Delimitrou, Christina ;
Kozyrakis, Christos .
ACM SIGPLAN NOTICES, 2013, 48 (04) :77-88