E2bird: Enhanced Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services

被引：16

作者：

Cui, Weihao ^{[1
]}

Chen, Quan ^{[1
,2
]}

Zhao, Han ^{[1
]}

Wei, Mengze ^{[1
]}

Tang, Xiaoxin ^{[3
]}

Guo, Minyi ^{[1
,2
]}

机构：

[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China

[2] Shanghai Jiao Tong Univ, Shanghai Inst Adv Commun & Data Sci, Shanghai 200240, Peoples R China

[3] Shanghai Univ Finance & Econ, Dept Comp Sci, Shanghai 200433, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2021年 / 32卷 / 06期

基金：

中国国家自然科学基金;

关键词：

GPUs; DL serving; latency; throughput; responsiveness;

D O I：

10.1109/TPDS.2020.3047638

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

We aim to tackle existing problems about deep learning serving on GPUs in the view of the system. GPUs have been widely adopted to serve online deep learning-based services that have stringent QoS(Quality-of-Service) requirements. However, emerging deep learning serving systems often result in poor responsiveness and low throughput of the inferences that damage user experience and increase the number of GPUs required to host an online service. Our investigation shows that the poor batching operation and the lack of data transfer-computation overlap are the root causes of the poor responsiveness and low throughput. To this end, we propose E$<^>2$2bird, a deep learning serving system that is comprised of a GPU-resident memory pool, a multi-granularity inference engine, and an elastic batch scheduler. The memory pool eliminates the unnecessary waiting of the batching operation and enables data transfer-computation overlap. The inference engine enables concurrent execution of different batches, improving the GPU resource utilization. The batch scheduler organizes inferences elastically to guarantee the QoS. Our experimental results on an Nvidia Titan RTX GPU show that E$<^>2$2bird reduces the response latency of inferences by up to 82.4 percent and improves the throughput by up to 62.8 percent while guaranteeing the QoS target compared with TensorFlow Serving.

引用

页码：1307 / 1321

页数：15

共 39 条

[1]

Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265

[2]

[Anonymous], 2019, NVIDIA TURING ARCHIT

[3]

[Anonymous], ARXIV151206216

[4]

Awan AA, 2017, ACM SIGPLAN NOTICES, V52, P193, DOI [10.1145/3155284.3018769, 10.1145/3018743.3018769]

[5]

Bakhoda A, 2009, INT SYM PERFORM ANAL, P163, DOI 10.1109/ISPASS.2009.4919648

[6]

Boski M, 2017, 2017 10TH INTERNATIONAL WORKSHOP ON MULTIDIMENSIONAL (ND) SYSTEMS (NDS)

[7]

Chellapilla K., 2006, 10 INT WORKSHOP FRON

[8]

Chen Q, 2017, TWENTY-SECOND INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXII), P17, DOI 10.1145/3037697.3037700

[9] Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers [J].

Chen, Quan ;

Yang, Hailong ;

Mars, Jason ;

Tang, Lingjia .

ACM SIGPLAN NOTICES, 2016, 51 (04) :681-696

[10]

Chen Tianqi, 2016, Training deep nets with sublinear memory cost

← 1 2 3 4 →