GARLSched: Generative adversarial deep reinforcement learning task scheduling optimization for large-scale high performance computing systems

被引:14
作者
Li, Jingbo [1 ]
Zhang, Xingjun [1 ]
Wei, Jia [1 ]
Ji, Zeyu [1 ]
Wei, Zheng [1 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Xian 710049, Peoples R China
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2022年 / 135卷
关键词
Task scheduling; Deep reinforcement learning; Distributed systems; High performance computing; Expert guidance;
D O I
10.1016/j.future.2022.04.032
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Efficient task scheduling has become increasingly complex as the number and type of tasks proliferate and the size of computing resource grows in large-scale distributed high-performance computing (HPC) systems. At present, deep reinforcement learning (DRL) methods have achieved certain success in scheduling problems. However, due to the exogeneity of the task and the sparsity of the reward, the learning of the DRL control policy requires a significant amount of training time and data and cannot guarantee effective convergence. Meanwhile, based on the understanding of HPC system characteristics, various scheduling policies with acceptable performance for different optimization goals have been developed by the experts. But these heuristic methods cannot adapt to environmental changes and optimize for specific workloads. Therefore, the generative adversarial reinforcement learning scheduling (GARLSched) algorithm is proposed to effectively guide the learning of DRL in large-scale dynamic scheduling issues based on the optimal policy in the expert pool. In addition, the task embedding-based discriminator network effectively improves and stabilizes the learning process. Experiments show that compared with heuristic and DRL scheduling algorithms, GARLSched can learn high-quality scheduling policies for various workloads and optimization objects. Furthermore, the learned models can perform stably even when applied to invisible workloads, making them more practical in HPC systems. (C) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:259 / 269
页数:11
相关论文
共 37 条
  • [1] Practical parallelization of scientific applications with OpenMP, OpenACC and MPI
    Aldinucci, Marco
    Cesare, Valentina
    Colonnelli, Iacopo
    Martinelli, Alberto Riccardo
    Mittone, Gianluca
    Cantalupo, Barbara
    Cavazzoni, Carlo
    Drocco, Maurizio
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 157 : 13 - 29
  • [2] [Anonymous], 2004, 21 INT C MACHINE LEA
  • [3] Obtaining Dynamic Scheduling Policies with Simulation and Machine Learning
    Carastan-Santos, Danilo
    de Camargo, Raphael Y.
    [J]. SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2017,
  • [4] H2O-Cloud: A Resource and Quality of Service-Aware Task Scheduling Framework for Warehouse-Scale Data Centers
    Cheng, Mingxi
    Li, Ji
    Bogdan, Paul
    Nazarian, Shahin
    [J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2020, 39 (10) : 2925 - 2937
  • [5] Cheng MX, 2018, ASIA S PACIF DES AUT, P129, DOI 10.1109/ASPDAC.2018.8297294
  • [6] Exploring the Limitations of Behavior Cloning for Autonomous Driving
    Codevilla, Felipe
    Santana, Eder
    Lopez, Antonio M.
    Gaidon, Adrien
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9328 - 9337
  • [7] Desai Narayan, 2009, P 2009 IEEE INT C CL, P1
  • [8] Energy-Aware VM Consolidation in Cloud Data Centers Using Utilization Prediction Model
    Farahnakian, Fahimeh
    Pahikkala, Tapio
    Liljeberg, Pasi
    Plosila, Juha
    Nguyen Trung Hieu
    Tenhunen, Hannu
    [J]. IEEE TRANSACTIONS ON CLOUD COMPUTING, 2019, 7 (02) : 524 - 536
  • [9] Experience with using the Parallel Workloads Archive
    Feitelson, Dror G.
    Tsafrir, Dan
    Krakov, David
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2014, 74 (10) : 2967 - 2982
  • [10] Online Tuning of EASY-Backfilling using Queue Reordering Policies
    Gaussier, Eric
    Lelong, Jerome
    Reis, Valentin
    Trystram, Denis
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (10) : 2304 - 2316