A gray-box modeling methodology for runtime prediction of Apache Spark jobs

被引:12
作者
Al-Sayeh, Hani [1 ]
Hagedorn, Stefan [1 ]
Sattler, Kai-Uwe [1 ]
机构
[1] Tech Univ Ilmenau, Ilmenau, Thuringen, Germany
关键词
Big data; Runtime prediction; Modeling; QUERIES;
D O I
10.1007/s10619-020-07286-y
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Apache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.
引用
收藏
页码:819 / 839
页数:21
相关论文
共 31 条
[1]  
Abiteboul S., 1998, Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. PODS 1998, P254, DOI 10.1145/275487.275516
[2]  
[Anonymous], 1979, P 1979 ACM SIGMOD IN, DOI DOI 10.1145/582095.582099
[3]  
[Anonymous], 2011, Starfish: A Self-tuning System for Big Data Analytics
[4]  
[Anonymous], 2019, APACHE SPARK MONITOR
[5]  
Apache spark official website, 2019, APACHE SPARK OFFICIA
[6]  
Camacho-Rodriguez J, 2016, TECHNICAL REPORT
[7]   A formal perspective on the view selection problem [J].
Chirkova, R ;
Halevy, AY ;
Suciu, D .
VLDB JOURNAL, 2002, 11 (03) :216-237
[8]   ReStore: Reusing Results of MapReduce Jobs [J].
Elghandour, Iman ;
Aboulnaga, Ashraf .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (06) :586-597
[9]   Design Framework for Informal Learning Based on Mobile Technologies [J].
Fu, Zhiyong ;
Lin, Jia ;
Zhou, Yuyao .
PROCEEDINGS OF CHINESE CHI 2018: SIXTH INTERNATIONAL SYMPOSIUM OF CHINESE CHI (CHINESE CHI 2018), 2018, :22-30
[10]   A multiple-classifier system for recognition of printed mathematical symbols [J].
Garain, U ;
Chaudhuri, BB ;
Ghosh, RP .
PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, 2004, :380-383