Benchmarking and Performance Modelling of Dataflow with Cycles

被引:0
作者
Ceesay, Sheriffo [1 ]
Lin, Yuhui [1 ]
Barker, Adam [1 ]
机构
[1] Univ St Andrews, St Andrews, Fife, Scotland
来源
8TH IEEE/ACM INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING, APPLICATIONS AND TECHNOLOGIES, BDCAT 2021 | 2021年
关键词
Dataflow With Cycles; Communication Patterns; Modelling; Machine Learning; Big Data;
D O I
10.1145/3492324.3494159
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Over the years, the popularity of iterative data-intensive applications such as machine learning applications has grown immensely. Unlike batch applications, iterative applications such as k-means, regression or classification algorithms require multiple access to the input data to train it sufficiently for convergence. In the context of big data, these applications are executed on distributed computing frameworks such as Apache Spark. These frameworks are simple to deploy and use, however, under the hood they are complex and highly configurable. To perform an exhaustive study of the impact of these ubiquitous parameters on application performance would be cumbersome due to the exponential amount of their combinations. In this paper, we group applications based on a common dataflow and communication pattern. We then present a multi-objective performance prediction framework to model the performance of these applications. The models can predict the execution time of a given application with high accuracy. The framework can be used to infer optimal configuration parameters to meet application execution deadlines. Based on these optimal configurable values, we recommend the best EC2 instances in terms of cost. The average error rate of the prediction results is +/- 14% from the measured value.
引用
收藏
页码:91 / 100
页数:10
相关论文
共 33 条
[1]  
BARNSTON AG, 1992, WEATHER FORECAST, V7, P699, DOI 10.1175/1520-0434(1992)007<0699:CATCRA>2.0.CO
[2]  
2
[3]  
Bergstra J, 2012, J MACH LEARN RES, V13, P281
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Benchmarking and Performance Modelling of MapReduce Communication Pattern [J].
Ceesay, Sheriffo ;
Barker, Adam ;
Lin, Yuhui .
11TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGY AND SCIENCE (CLOUDCOM 2019), 2019, :127-134
[6]  
Ceesay S, 2017, IEEE INT CONF BIG DA, P2821, DOI 10.1109/BigData.2017.8258249
[7]   A gray-box performance model for Apache Spark [J].
Chao, Zemin ;
Shi, Shengfei ;
Gao, Hong ;
Luo, Jizhou ;
Wang, Hongzhi .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 89 :58-67
[8]  
Chowdhury M, 2012, PROCEEDINGS OF THE 11TH ACM WORKSHOP ON HOT TOPICS IN NETWORKS (HOTNETS-XI), P31
[9]  
Dean J, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE SIXTH SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDE '04), P137
[10]  
Freund Y, 1999, MACHINE LEARNING, PROCEEDINGS, P124