GARLSched: Generative adversarial deep reinforcement learning task scheduling optimization for large-scale high performance computing systems

被引：14

作者：

Li, Jingbo ^{[1
]}

Zhang, Xingjun ^{[1
]}

Wei, Jia ^{[1
]}

Ji, Zeyu ^{[1
]}

Wei, Zheng ^{[1
]}

机构：

[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Xian 710049, Peoples R China

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2022年 / 135卷

关键词：

Task scheduling; Deep reinforcement learning; Distributed systems; High performance computing; Expert guidance;

D O I：

10.1016/j.future.2022.04.032

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Efficient task scheduling has become increasingly complex as the number and type of tasks proliferate and the size of computing resource grows in large-scale distributed high-performance computing (HPC) systems. At present, deep reinforcement learning (DRL) methods have achieved certain success in scheduling problems. However, due to the exogeneity of the task and the sparsity of the reward, the learning of the DRL control policy requires a significant amount of training time and data and cannot guarantee effective convergence. Meanwhile, based on the understanding of HPC system characteristics, various scheduling policies with acceptable performance for different optimization goals have been developed by the experts. But these heuristic methods cannot adapt to environmental changes and optimize for specific workloads. Therefore, the generative adversarial reinforcement learning scheduling (GARLSched) algorithm is proposed to effectively guide the learning of DRL in large-scale dynamic scheduling issues based on the optimal policy in the expert pool. In addition, the task embedding-based discriminator network effectively improves and stabilizes the learning process. Experiments show that compared with heuristic and DRL scheduling algorithms, GARLSched can learn high-quality scheduling policies for various workloads and optimization objects. Furthermore, the learned models can perform stably even when applied to invisible workloads, making them more practical in HPC systems. (C) 2022 Elsevier B.V. All rights reserved.

引用

页码：259 / 269

页数：11

共 37 条

[1] Practical parallelization of scientific applications with OpenMP, OpenACC and MPI
Aldinucci, Marco
Cesare, Valentina
Colonnelli, Iacopo
Martinelli, Alberto Riccardo
Mittone, Gianluca
Cantalupo, Barbara
Cavazzoni, Carlo
Drocco, Maurizio
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 157 : 13 - 29
[2] [Anonymous], 2004, 21 INT C MACHINE LEA
[3] Obtaining Dynamic Scheduling Policies with Simulation and Machine Learning
Carastan-Santos, Danilo
de Camargo, Raphael Y.
[J]. SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2017,
[4] H2O-Cloud: A Resource and Quality of Service-Aware Task Scheduling Framework for Warehouse-Scale Data Centers
Cheng, Mingxi
Li, Ji
Bogdan, Paul
Nazarian, Shahin
[J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2020, 39 (10) : 2925 - 2937
[5] Cheng MX, 2018, ASIA S PACIF DES AUT, P129, DOI 10.1109/ASPDAC.2018.8297294
[6] Exploring the Limitations of Behavior Cloning for Autonomous Driving
Codevilla, Felipe
Santana, Eder
Lopez, Antonio M.
Gaidon, Adrien
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9328 - 9337
[7] Desai Narayan, 2009, P 2009 IEEE INT C CL, P1
[8] Energy-Aware VM Consolidation in Cloud Data Centers Using Utilization Prediction Model
Farahnakian, Fahimeh
Pahikkala, Tapio
Liljeberg, Pasi
Plosila, Juha
Nguyen Trung Hieu
Tenhunen, Hannu
[J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2019, 7 (02) : 524 - 536
[9] Experience with using the Parallel Workloads Archive
Feitelson, Dror G.
Tsafrir, Dan
Krakov, David
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2014, 74 (10) : 2967 - 2982
[10] Online Tuning of EASY-Backfilling using Queue Reordering Policies
Gaussier, Eric
Lelong, Jerome
Reis, Valentin
Trystram, Denis
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (10) : 2304 - 2316

← 1 2 3 4 →