Probabilistic Reservation Services for Large-Scale Batch-Scheduled Systems

被引:3
作者
Nurmi, Daniel [1 ]
Wolski, Rich [1 ]
Brevik, John [2 ]
机构
[1] Univ Calif Santa Barbara, Dept Comp Sci, Santa Barbara, CA 93106 USA
[2] Calif State Univ Long Beach, Dept Math & Stat, Long Beach, CA 90840 USA
来源
IEEE SYSTEMS JOURNAL | 2009年 / 3卷 / 01期
关键词
D O I
10.1109/JSYST.2008.2011303
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In high-performance computing (HPC) settings, in which multiprocessor machines are shared among users with potentially competing resource demands, processors are allocated to user workload using space sharing. Typically, users interact with a given machine by submitting their jobs to a centralized batch scheduler that implements a site-specific, and often partially hidden, policy designed to maximize machine utilization while providing tolerable turnaround times. In practice, while most HPC systems experience good utilization levels, the amount of time experienced by individual jobs waiting to begin execution has been shown to be highly variable and difficult to predict, leading to user confusion and/or frustration. One method for dealing with this uncertainty that has been proposed is the ability to predict the amount of time that individual jobs will wait in batch queues once they are submitted, thus allowing a user to reason about the total time between job submission and job completion (which we term a job's "overall turnaround time"). Another related but distinct method for handling the uncertainty is to allow users who are willing to plan ahead to make "advanced reservations" for processor resources, again allowing them to reason about job turnaround time. To date, however, few if any HPC centers provide either job-queue delay prediction services or advanced reservation capabilities to their general user populations. In this paper, we describe QBETS, VARQ, and CO-VARQ, new methods for allowing users to reason and control the overall turnaround time of their batch-queue jobs submitted to busy HPC systems in existence today. QBETS is an online, non-parametric system for predicting statistical bounds on the amount of time individual batch jobs will wait in queue. VARQ is a new method for job scheduling that provides users with probabilistic "virtual" advanced reservations using only existing best effort batch schedulers and policies, and CO-VARQ utilizes this capability to implement a general coallocation service. QBETS, VARQ and CO-VARQ operate as overlays, requiring no modification to the local scheduler implementation or policies. We describe the statistical methods we use to implement the systems, detail empirical evaluations of their effectiveness in a number of HPC settings, and explore the potential future impact of these systems should they become widely used.
引用
收藏
页码:6 / 24
页数:19
相关论文
共 51 条
  • [1] *ALT, PBSPR HOM PAG
  • [2] [Anonymous], ACM SIGMETRICS PERFO
  • [3] [Anonymous], J MACH LEARN RES
  • [4] Berman F., 2003, GRID COMPUTING MAKIN
  • [5] Box G. E. P., 1970, Time series analysis, forecasting and control
  • [6] BREVIK J, 2006, PPOPP BOST MA MAR
  • [7] BREVIK J, 2006, IEEE INT S WORKL CHA
  • [8] BUCUR A, 2003, 3 UEEE ACM INT S CLU
  • [9] CLEARWATER S, 2002, SAND20022378C
  • [10] *CLUST RES, TORQ HOM PAG