Ebird: Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services

被引：22

作者：

Cui, Weihao ^{[1
]}

Wei, Mengze ^{[1
]}

Chen, Quan ^{[1
]}

Tang, Xiaoxin ^{[2
]}

Leng, Jingwen ^{[1
]}

Li, Li ^{[1
]}

Guo, Mingyi ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Dept CSE, Shanghai, Peoples R China

[2] Shanghai Univ Finance & Econ, Shanghai, Peoples R China

来源：

2019 IEEE 37TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2019) | 2019年

基金：

中国国家自然科学基金;

关键词：

GPUs; DL Serving; Latency; Throughput; Responsiveness; NONPREEMPTIVE ACCELERATORS;

D O I：

10.1109/ICCD46524.2019.00075

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

GPUs have been widely adopted to serve online deep learning-based services that have stringent QoS requirements. However, emerging deep learning serving systems often result in long latency, and low throughput of the inference request that damage user experience and increase the number of GPUs required to host an online service. Our investigation shows that the poor batching operation and the lacking of data transfer-computation overlap are the root causes of the long latency and low throughput. To this end, we propose Ebird, a deep learning serving system that is comprised of a GPU-resident memory pool, a multi-granularity inference engine, and an elastic batch scheduler. The memory pool eliminates the unnecessary waiting of the batching operation and enables data transfer-computation overlap. The inference engine enables concurrent execution of different batches, improving the GPUs resource utilization. The batch scheduler organizes inference requests elastically. Our experimental results on an Nvidia Titan RTX GPU show that Ebird reduces the response latency of inferences by up to 70.9% and improves the throughput by up to 49.3% while guaranteeing the QoS target compared with TensorFlow Serving.

引用

页码：497 / 505

页数：9

共 24 条

[1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2] [Anonymous], 2019, NVIDIA TURING ARCHIT
[3] [Anonymous], 2019, TENSORFLOW SERVING B
[4] [Anonymous], 2014, ARXIV NEURAL EVOLUTI
[5] [Anonymous], TECHNICAL REPORT
[6] [Anonymous], ARXIV151206216
[7] Awan AA, 2017, ACM SIGPLAN NOTICES, V52, P193, DOI [10.1145/3018743.3018769, 10.1145/3155284.3018769]
[8] Chen Q, 2017, OPER SYST REV, V51, P17, DOI 10.1145/3037697.3037700
[9] Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
Chen, Quan
Yang, Hailong
Mars, Jason
Tang, Lingjia
[J]. ACM SIGPLAN NOTICES, 2016, 51 (04) : 681 - 696
[10] Crankshaw D, 2017, PROCEEDINGS OF NSDI '17: 14TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P613

← 1 2 3 →