Inferring Workflows with Job Dependencies from Distributed Processing Systems Logs (Or, how to evaluate your systems with realistic workflows NOT pulled out of thin air)

被引:0
作者
Carrillo, Gladys E. [1 ]
Abad, Cristina L. [1 ]
机构
[1] Escuela Super Politecn Litoral, ESPOL, Campus Gustavo Galindo Km 30-5 Via Perimetral, Guayaquil, Ecuador
来源
2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI | 2017年
关键词
Distributed processing; clusters; data mining; Hadoop; workflows; workloads; FRAMEWORK;
D O I
10.1109/DASC-PICom-DataCom-CyberSciTec.2017.168
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider the problem of evaluating new improvements to distributed processing platforms like Spark and Hadoop. One approach commonly used when evaluating these systems is to use workloads published by companies with large data clusters, like Google and Facebook. These evaluations seek to demonstrate the benefits of improvements to critical framework components like the job scheduler, under realistic workloads. However, published workloads typically do not contain information on dependencies between the jobs. This is problematic, as ignoring dependencies could lead to significantly misestimating the speedup obtained from a particular improvement. In this position paper, we discuss why it is important to include job dependency information when evaluating distributed processing frameworks, and show that workflow mining techniques can be used to obtain dependencies from job traces that lack them. As a proof-of-concept, we show that the proposed methodology is able to find workflows in traces published by Google.
引用
收藏
页码:1025 / 1030
页数:6
相关论文
共 33 条
  • [1] Abad C. L., 2011, IEEE INT C CLUSTER C
  • [2] Agrawal R, 1998, LECT NOTES COMPUT SC, V1377, P469
  • [3] Ananthanarayanan G., 2012, USENIX C NETWORKED S
  • [4] [Anonymous], 2014, ACM S CLOUD COMPUTIN
  • [5] [Anonymous], 2013, Apache Hadoop
  • [6] [Anonymous], 2012, ACM S CLOUD COMPUTIN
  • [7] [Anonymous], 2016, P 7 ACM S CLOUD COMP
  • [8] Burattin A, 2015, LECT NOTES BUS INF P, V207, pV, DOI 10.1007/978-3-319-17482-2
  • [9] Cho B., 2013, ACM S CLOUD COMPUTIN
  • [10] Couvares P., 2007, Workflows for e-Science