Efficient Performance Prediction for Apache Spark

被引:26
作者
Cheng, Guoli [1 ]
Ying, Shi [1 ]
Wang, Bingming [1 ]
Li, Yuhang [1 ]
机构
[1] Wuhan Univ, Sch Comp Sci, Bayi Rd 299, Wuhan, Peoples R China
基金
中国国家自然科学基金;
关键词
Performance prediction; Spark; System configuration; Adaboost; Projective sampling;
D O I
10.1016/j.jpdc.2020.10.010
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Spark is a more efficient distributed big data processing framework following Hadoop. It provides users with more than 180 adjustable configuration parameters, and how to choose the optimal configuration automatically to make the Spark application run effectively is challenging. The key to address the above challenge is having the ability to predict the performance of Spark applications in different configurations. This paper proposes a new approach based on Adaboost, which can efficiently and accurately predict the performance of a given application with a given Spark configuration. In our approach, Adaboost is used to build a set of performance models at the stage-level for Spark. To minimize the overhead of the modeling, we use the classic projective sampling, a data mining technique that allows us to collect as few training samples as possible while meeting the accuracy requirements. We evaluate the proposed approach on six typical Spark benchmarks with five input datasets. The experimental results show that our approach is less than the previously proposed approach in prediction error and cost. (C) 2020 Elsevier Inc. All rights reserved.
引用
收藏
页码:40 / 51
页数:12
相关论文
共 40 条
[1]   Spark SQL: Relational Data Processing in Spark [J].
Armbrust, Michael ;
Xin, Reynold S. ;
Lian, Cheng ;
Huai, Yin ;
Liu, Davies ;
Bradley, Joseph K. ;
Meng, Xiangrui ;
Kaftan, Tomer ;
Franklint, Michael J. ;
Ghodsi, Ali ;
Zaharia, Matei .
SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, :1383-1394
[2]  
Assefi M, 2017, IEEE INT CONF BIG DA, P3492
[3]   RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration [J].
Bei, Zhendong ;
Yu, Zhibin ;
Zhang, Huiling ;
Xiong, Wen ;
Xu, Chengzhong ;
Eeckhout, Lieven ;
Feng, Shengzhong .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (05) :1470-1483
[4]  
Bonner S, 2016, 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), P3298, DOI 10.1109/BigData.2016.7840989
[5]   Machine Learning-Based Configuration Parameter Tuning on Hadoop System [J].
Chen, Chi-Ou ;
Zhuo, Ye-Qi ;
Yeh, Chao-Chun ;
Lin, Che-Min ;
Liao, Shih-wei .
2015 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2015, 2015, :386-392
[6]  
Drucker H., 1997, Icml, V97, P107, DOI DOI 10.1006/JCSS.1997.1504
[7]   A decision-theoretic generalization of on-line learning and an application to boosting [J].
Freund, Y ;
Schapire, RE .
JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 1997, 55 (01) :119-139
[8]  
Glushkova D., 2017, P WORKSH EDBT ICDT 2, P1
[9]   A Methodology for Spark Parameter Tuning [J].
Gounaris, Anastasios ;
Torres, Jordi .
BIG DATA RESEARCH, 2018, 11 :22-32
[10]   Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark [J].
Gu, Lei ;
Li, Huan .
2013 IEEE 15TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (HPCC_EUC), 2013, :721-727