Job assignment in machine learning inference systems with accuracy constraints

被引：0

作者：

Choudhury, Tuhinangshu ^{[1
]}

Joshi, Gauri ^{[1
]}

Wang, Weina ^{[2
]}

机构：

[1] Carnegie Mellon Univ, ECE, Pittsburgh, PA 15213 USA

[2] Carnegie Mellon Univ, CS, Pittsburgh, PA USA

来源：

PERFORMANCE EVALUATION | 2025年 / 167卷

关键词：

Dispatching; Response time; Throughput; Service capacity region; Performance modeling; Queueing; Load balancing; IDLE-QUEUE; LOAD; DEADLINES; AWARE;

D O I：

10.1016/j.peva.2024.102463

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Modern machine learning inference systems often host multiple models that can perform the same task with different levels of accuracy and latency. For example, a large model can be more accurate but slow, whereas a smaller and less accurate can be faster in serving inference queries. Amidst the rapid advancements in Large Language Models (LLMs), it is paramount for such systems to strike the best trade-off between latency and accuracy. In this paper, we consider the problem of designing job assignment policies for a multi-server queueing system where servers have heterogeneous rates and accuracies, and our goal is to minimize the expected inference latency while meeting an average accuracy target. Such queueing systems with constraints have been sparsely studied in prior literature to the best of our knowledge. We first identify a lower bound on the minimum achievable latency under any policy that achieves the target accuracy a* using a linear programming (LP) formulation. Building on the LP solution, we introduce a Randomized-Join-the Idle Queue (R-JIQ) policy, which consistently meets the accuracy target and asymptotically (as system size increases) achieves the optimal latency T LP-LB ( lambda ). However, the R-JIQ policy relies on the knowledge of the arrival rate lambda to solve the LP. To address this limitation, we propose the Prioritize Ordered Pairs (POP) policy that incorporates the concept of ordered pairs of servers into waterfilling to iteratively solve the LP. This allows the POP policy to function without relying on the arrival rate. Experiments suggest that POP performs robustly across different system sizes and load scenarios, achieving near-optimal performance.

引用

页数：29

共 57 条

[1] Optimal control of batch service queues with finite service capacity and linear holding costs
Aalto, S
[J]. MATHEMATICAL METHODS OF OPERATIONS RESEARCH, 2000, 51 (02) : 263 - 285
[2] Andrews Matthew, 2012, 2012 46 ANN C INF SC, P1
[3] Whittle's index policy for a multi-class queueing system with convex holding costs
P. S. Ansell
K. D. Glazebrook
J. Niño-Mora
M. O'Keeffe
[J]. Mathematical Methods of Operations Research, 2003, 57 (1) : 21 - 39
[4] DYNAMIC ROUTING OF CUSTOMERS WITH GENERAL DELAY COSTS IN A MULTISERVER QUEUING SYSTEM
Argon, Nilay Tanik
Ding, Li
Glazebrook, Kevin D.
Ziya, Serhan
[J]. PROBABILITY IN THE ENGINEERING AND INFORMATIONAL SCIENCES, 2009, 23 (02) : 175 - 203
[5] Asymptotic optimality of speed-aware JS']JSQ for heterogeneous service systems
Bhambay, Sanidhay
Mukhopadhyay, Arpan
[J]. PERFORMANCE EVALUATION, 2022, 157
[6] Binmore K.G., 1982, Mathematical Analysis: A Straightforward Approach, Vsecond
[7] Bolukbasi T, 2017, PR MACH LEARN RES, V70
[8] Steady-State Analysis of the Join-the-Shortest-Queue Model in the Halfin-Whitt Regime
Braverman, Anton
[J]. MATHEMATICS OF OPERATIONS RESEARCH, 2020, 45 (03) : 1069 - 1103
[9] Brown TB, 2020, ADV NEUR IN, V33
[10] Crankshaw D, 2017, PROCEEDINGS OF NSDI '17: 14TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P613

← 1 2 3 4 5 6 →