Improving Backfilling by using Machine Learning to Predict Running Times

被引:62
作者
Gaussier, Eric [1 ]
Glesser, David [2 ]
Reis, Valentin [1 ]
Trystram, Denis [1 ]
机构
[1] Univ Grenoble Alpes, LIG, Grenoble, France
[2] BULL HPC Div, Grenoble, France
来源
PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2015年
关键词
High Performance Computing; Running Time Estimation; Scheduling; Machine Learning;
D O I
10.1145/2807591.2807646
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The job management system is the HPC middleware responsible for distributing computing power to applications. While such systems generate an ever increasing amount of data, they are characterized by uncertainties on some parameters like the job running times. The question raised in this work is: To what extent is it possible/useful to take into account predictions on the job running times for improving the global scheduling? We present a comprehensive study for answering this question assuming the popular EASY backfilling policy. More precisely, we rely on some classical methods in machine learning and propose new cost functions well-adapted to the problem. Then, we assess our proposed solutions through intensive simulations using several production logs. Finally, we propose a new scheduling algorithm that outperforms the popular EASY backfilling algorithm by 28% considering the average bounded slowdown objective.
引用
收藏
页数:10
相关论文
共 23 条
[1]  
Bottou L., 2004, ADV LECT MACHINE LEA
[2]  
Cesa-Bianchi N., 2006, PREDICTION LEARNING
[3]  
Duan R., 2009, CLUSTER COMPUTING GR
[4]  
Feitelson D. G., 2008, JOB SCHEDULING STRAT
[5]  
Feitelson D. G., 2014, J PARALLEL DISTRIBUT
[6]  
Feitelson D. G., 2001, PARALLEL DISTRIBUTED
[7]  
Feitelson D. G., 2001, JOB SCHEDULING STRAT
[8]  
Fortes J., 2010, CLUSTER CLOUD GRID C
[9]  
Frachtenberg E., 2005, JOB SCHEDULING STRAT
[10]  
Georgiou Y, 2010, THESIS