Resource Utilization Aware Job Scheduling to Mitigate Performance Variability

被引：6

作者：

Nichols, Daniel ^{[1
]}

Marathe, Aniruddha ^{[2
]}

Shoga, Kathleen ^{[2
]}

Gamblin, Todd ^{[2
]}

Bhatele, Abhinav ^{[1
]}

机构：

[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA

[2] Lawrence Livermore Natl Lab, Livermore, CA 94551 USA

来源：

2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2022) | 2022年

基金：

美国国家科学基金会;

关键词：

performance variability; data analytics; machine learning; prediction models; scheduling;

D O I：

10.1109/IPDPS53621.2022.00040

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Resource contention on high performance computing (HPC) platforms can lead to significant variation in application performance. When several jobs experience such large variations in run times, it can lead to less efficient use of system resources. It can also lead to users over-estimating their job's expected run time, which degrades the efficiency of the system scheduler. Mitigating performance variation on HPC platforms benefits end users and also enables more efficient use of system resources. In this paper, we present a pipeline for collecting and analyzing system and application performance data for jobs submitted over long periods of time. We use a set of machine learning (ML) models trained on this data to classify performance variation using current system counters. Additionally, we present a new resource-aware job scheduling algorithm that utilizes the ML pipeline and current system state to mitigate job variation. We evaluate our pipeline, ML models, and scheduler using various proxy applications and an actual implementation of the scheduler on an Infiniband-based fat-tree cluster.

引用

页码：335 / 345

页数：11

共 29 条

[1] Modeling Expected Application Runtime for Characterizing and Assessing Job Performance Workshop paper: HPCMASPA 2018 [J].

Aaziz, Omar ;

Cook, Jonathan ;

Tanash, Mohammed .

2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, :543-551

[2] HPCTOOLKIT: tools for performance analysis of optimized parallel programs [J].

Adhianto, L. ;

Banerjee, S. ;

Fagan, M. ;

Krentel, M. ;

Marin, G. ;

Mellor-Crummey, J. ;

Tallent, N. R. .

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2010, 22 (06) :685-701

[3] The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications [J].

Agelastos, Anthony ;

Allan, Benjamin ;

Brandt, Jim ;

Cassella, Paul ;

Enos, Jeremy ;

Fullop, Joshi ;

Gentile, Ann ;

Monk, Steve ;

Naksinehaboon, Nichamon ;

Ogden, Jeff ;

Rajan, Mahesh ;

Showerman, Michael ;

Stevenson, Joel ;

Taerat, Narate ;

Tucker, Tom .

SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2014, :154-165

[4] Flux: Overcoming Scheduling Challenges for Exascale Workflows [J].

Ahn, Dong H. ;

Bass, Ned ;

Chu, Albert ;

Garlick, Jim ;

Grondona, Mark ;

Herbein, Stephen ;

Koning, Joseph ;

Patki, Tapasya ;

Scogland, Thomas R. W. ;

Springmeyer, Becky ;

Taufer, Michela .

PROCEEDINGS OF WORKS 2018: 13TH IEEE/ACM WORKSHOP ON WORKFLOWS IN SUPPORT OF LARGE-SCALE SCIENCE (WORKS), 2018, :10-19

[5]

[Anonymous], 2017, SWFFT

[6]

[Anonymous], 2017, SW4LITE

[7]

[Anonymous], 2021, IBM SPECTRUM LSF SES

[8]

[Anonymous], 2020, Slurm workload manager

[9]

[Anonymous], 2003, P 2003 ACM IEEE C SU, DOI DOI 10.1145/1048935.1050204

[10] Hatchet: Pruning the Overgrowth in Parallel Profiles [J].

Bhatele, Abhinav ;

Brink, Stephanie ;

Gamblin, Todd .

PROCEEDINGS OF SC19: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2019,

← 1 2 3 →