Optimization of High-Performance Computing Job Scheduling Based on Offline Reinforcement Learning

被引:0
作者
Li, Shihao [1 ,2 ]
Dai, Wei [1 ,2 ]
Chen, Yongyan [1 ,2 ]
Liang, Bo [1 ,2 ]
机构
[1] Kunming Univ Sci & Technol, Fac Informat Engn & Automat, Kunming 650500, Peoples R China
[2] Kunming Univ Sci & Technol, Comp Technol Applicat Key Lab Yunnan Prov, Kunming 650500, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 23期
基金
中国国家自然科学基金;
关键词
job scheduling; offline reinforcement learning; high-performance computing;
D O I
10.3390/app142311220
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
In large-scale, distributed high-performance computing systems, the increasing complexity of job scheduling has expanded along with the growth of computational resources and job diversity. While heuristic scheduling strategies with various optimization objectives have shown promising results, their effectiveness is often limited in real-world applications due to the dynamic nature of workloads and system configurations. Deep reinforcement learning (DRL) methods offer the potential to address scheduling challenges. However, their trial-and-error learning approach can lead to suboptimal performance or resource wastage in the early stages. To mitigate these risks, this paper introduces an offline reinforcement learning-based job scheduling method. By training on historical data, the method avoids the pitfalls of deploying immature strategies in live environments. We constructed an offline dataset by combining expert scheduling trajectories with early-stage trial data from online reinforcement learning. This enables the development of more robust scheduling policies. Experimental results demonstrate that, compared to heuristic and online DRL algorithms, the proposed approach achieves more efficient scheduling performance across various workloads and optimization goals, showcasing its practicality and broad applicability.
引用
收藏
页数:15
相关论文
共 40 条
[1]   Heuristics and augmented neural networks for task scheduling with non-identical machines [J].
Agarwal, Anurag ;
Colak, Selcuk ;
Jacob, Varghese S. ;
Pirkul, Hasan .
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2006, 175 (01) :296-317
[2]   A review on evolution of production scheduling with neural networks [J].
Akyol, Derya Eren ;
Bayhan, G. Mirac .
COMPUTERS & INDUSTRIAL ENGINEERING, 2007, 53 (01) :95-122
[3]   Power-aware linear programming based scheduling for heterogeneous computer clusters [J].
Al-Daoud, Hadil ;
Al-Azzoni, Issam ;
Down, Douglas G. .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2012, 28 (05) :745-754
[4]   Optimization Metrics for the Evaluation of Batch Schedulers in HPC [J].
Boezennec, Robin ;
Dufosse, Fanny ;
Pallez, Guillaume .
JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING, JSSPP 2023, 2023, 14283 :97-115
[5]   AUTO-ASSOCIATION BY MULTILAYER PERCEPTRONS AND SINGULAR VALUE DECOMPOSITION [J].
BOURLARD, H ;
KAMP, Y .
BIOLOGICAL CYBERNETICS, 1988, 59 (4-5) :291-294
[6]   Obtaining Dynamic Scheduling Policies with Simulation and Machine Learning [J].
Carastan-Santos, Danilo ;
de Camargo, Raphael Y. .
SC'17: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2017,
[7]   Q-Learning: Theory and Applications [J].
Clifton, Jesse ;
Laber, Eric .
ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION, VOL 7, 2020, 2020, 7 :279-301
[8]   Exploring the Limitations of Behavior Cloning for Autonomous Driving [J].
Codevilla, Felipe ;
Santana, Eder ;
Lopez, Antonio M. ;
Gaidon, Adrien .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9328-9337
[9]  
Du J., 1989, SIAM J DISCRET MATH, V2, P473, DOI [DOI 10.1137/0402042, 10.1137/0402042]
[10]  
Feitelson DG, 1997, LECT NOTES COMPUT SC, V1291, P1