Benchmarking and Performance Modelling of Dataflow with Cycles

被引:0
作者
Ceesay, Sheriffo [1 ]
Lin, Yuhui [1 ]
Barker, Adam [1 ]
机构
[1] Univ St Andrews, St Andrews, Fife, Scotland
来源
8TH IEEE/ACM INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING, APPLICATIONS AND TECHNOLOGIES, BDCAT 2021 | 2021年
关键词
Dataflow With Cycles; Communication Patterns; Modelling; Machine Learning; Big Data;
D O I
10.1145/3492324.3494159
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Over the years, the popularity of iterative data-intensive applications such as machine learning applications has grown immensely. Unlike batch applications, iterative applications such as k-means, regression or classification algorithms require multiple access to the input data to train it sufficiently for convergence. In the context of big data, these applications are executed on distributed computing frameworks such as Apache Spark. These frameworks are simple to deploy and use, however, under the hood they are complex and highly configurable. To perform an exhaustive study of the impact of these ubiquitous parameters on application performance would be cumbersome due to the exponential amount of their combinations. In this paper, we group applications based on a common dataflow and communication pattern. We then present a multi-objective performance prediction framework to model the performance of these applications. The models can predict the execution time of a given application with high accuracy. The framework can be used to infer optimal configuration parameters to meet application execution deadlines. Based on these optimal configurable values, we recommend the best EC2 instances in terms of cost. The average error rate of the prediction results is +/- 14% from the measured value.
引用
收藏
页码:91 / 100
页数:10
相关论文
共 33 条
[11]   A Methodology for Spark Parameter Tuning [J].
Gounaris, Anastasios ;
Torres, Jordi .
BIG DATA RESEARCH, 2018, 11 :22-32
[12]  
Hartigan J. A., 1979, Applied Statistics, V28, P100, DOI 10.2307/2346830
[13]  
Huang SS, 2010, I C DATA ENGIN WORKS, P41, DOI 10.1109/ICDEW.2010.5452747
[14]   dSpark: Deadline-based Resource Allocation for Big Data Applications in Apache Spark [J].
Islam, Muhammed Tawfiqul ;
Karunasekera, Shanika ;
Buyya, Rajkumar .
2017 IEEE 13TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE), 2017, :89-98
[15]   Building Predictive Models in R Using the caret Package [J].
Kuhn, Max .
JOURNAL OF STATISTICAL SOFTWARE, 2008, 28 (05) :1-26
[16]   Marketized state ownership and foreign expansion of emerging market multinationals: Leveraging institutional competitive advantages [J].
Li, Ming Hua ;
Cui, Lin ;
Lu, Jiangyong .
ASIA PACIFIC JOURNAL OF MANAGEMENT, 2017, 34 (01) :19-46
[17]  
Louppe G, 2015, Arxiv, DOI [arXiv:1407.7502, 10.48550/arXiv.1407.7502, DOI 10.48550/ARXIV.1407.7502]
[18]  
Meng XR, 2016, J MACH LEARN RES, V17
[19]  
Miles J., 2005, ENCY STAT BEHAV SCI
[20]  
Myers R.H., 1990, Classical and Modern Regression with Applications, V2