Exploring Plan-Based Scheduling for Large-Scale Computing Systems

被引:9
|
作者
Zheng, Xingwu [1 ]
Zhou, Zhou [2 ]
Yang, Xu [2 ]
Lan, Zhiling [2 ]
Wang, Jia [1 ]
机构
[1] IIT, Dept Elect & Comp Engn, Chicago, IL 60616 USA
[2] IIT, Dept Comp Sci, Chicago, IL 60616 USA
来源
2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) | 2016年
关键词
Plan-based scheduling; Simulated Annealing algorithm; Optimization;
D O I
10.1109/CLUSTER.2016.43
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As HPC systems scale toward exascale, it becomes critical to manage the underlying resource more effectively. While almost all existing resource management systems schedule jobs in a queuing fashion and have drawbacks of making isolated scheduling decisions that would compromise system performance even with backfilling, plan-based schedulers have the potential to generate better job schedules by producing an execution plan of all waiting jobs but do not receive enough attention. In this paper, we present a novel plan-based scheduling system that utilizes simulated annealing as the optimization engine to support effective resource management on HPC systems. As demonstrated by extensive trace-based simulations with workload traces collected from a wide range of production supercomputers, in comparison with the queue-based scheduling system using FCFS with EASY backfilling, our plan-based scheduling system can reduce the job wait time by 40%, reduce the job response time by 30%, while slightly improving system utilization at the same time. Moreover, our plan-based system is able to run online by solving the scheduling problem at each scheduling iteration within one second, making it practical for production HPC systems.
引用
收藏
页码:259 / 268
页数:10
相关论文
共 50 条
  • [1] Joint multicast beamforming and user scheduling in large-scale antenna systems
    Zhou, Longfei
    Xu, Zi
    Jiang, Wei
    Luo, Wu
    IET COMMUNICATIONS, 2018, 12 (11) : 1307 - 1314
  • [2] Formal verification and performance evaluation of task scheduling heuristics for makespan optimization and workflow distribution in large-scale computing systems
    Zaman, Sardar Khaliq Uz
    Khan, Atta Ur Rehman
    Malik, Saif Ur Rehman
    Khan, Abdul Nasir
    Maqsood, Tahir
    Madani, Sajjad A.
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2017, 32 (03): : 227 - 241
  • [3] A Reinforcement Learning Based Large-Scale Refinery Production Scheduling Algorithm
    Chen, Yuandong
    Ding, Jinliang
    Chen, Qingda
    IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, 21 (04) : 6041 - 6055
  • [4] A Knowledge Transfer Based Scheduling Algorithm for Large-Scale Refinery Production
    Chen, Yuandong
    Ding, Jinliang
    Chai, Tianyou
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2022, 18 (02) : 869 - 879
  • [5] Short-term scheduling of large-scale hydropower systems for energy maximization
    Wang, JW
    Yuan, XH
    Zhang, YC
    JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT, 2004, 130 (03) : 198 - 205
  • [6] Predictive Cyber Foraging for Visual Cloud Computing in Large-Scale IoT Systems
    Patman, Jon
    Chemodanov, Dmitrii
    Calyam, Prasad
    Palaniappan, Kannappan
    Sterle, Claudio
    Boccia, Maurizio
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2020, 17 (04): : 2380 - 2395
  • [7] Evolutionary approach for large-Scale mine scheduling
    Elsayed, Saber
    Sarker, Ruhul
    Essam, Daryl
    Coello Coello, Carlos A.
    INFORMATION SCIENCES, 2020, 523 (523) : 77 - 90
  • [8] Large-scale Maintenance Scheduling of Wind Turbines
    Liu, Libo
    Zhou, Yifan
    Yan, Bin
    Liu, Jingjing
    2019 PROGNOSTICS AND SYSTEM HEALTH MANAGEMENT CONFERENCE (PHM-QINGDAO), 2019,
  • [9] Large-scale optimization of nonconvex MINLP refinery scheduling
    Franzoi, Robert E.
    Menezes, Brenno C.
    Kelly, Jeffrey D.
    Gut, Jorge A. W.
    Grossmann, Ignacio E.
    COMPUTERS & CHEMICAL ENGINEERING, 2024, 186
  • [10] Scheduling of maintenance work of a large-scale tramway network
    Kiefer, Alexander
    Schilde, Michael
    Doerner, Karl F.
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2018, 270 (03) : 1158 - 1170