Ebird: Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services

被引:22
作者
Cui, Weihao [1 ]
Wei, Mengze [1 ]
Chen, Quan [1 ]
Tang, Xiaoxin [2 ]
Leng, Jingwen [1 ]
Li, Li [1 ]
Guo, Mingyi [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept CSE, Shanghai, Peoples R China
[2] Shanghai Univ Finance & Econ, Shanghai, Peoples R China
来源
2019 IEEE 37TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2019) | 2019年
基金
中国国家自然科学基金;
关键词
GPUs; DL Serving; Latency; Throughput; Responsiveness; NONPREEMPTIVE ACCELERATORS;
D O I
10.1109/ICCD46524.2019.00075
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
GPUs have been widely adopted to serve online deep learning-based services that have stringent QoS requirements. However, emerging deep learning serving systems often result in long latency, and low throughput of the inference request that damage user experience and increase the number of GPUs required to host an online service. Our investigation shows that the poor batching operation and the lacking of data transfer-computation overlap are the root causes of the long latency and low throughput. To this end, we propose Ebird, a deep learning serving system that is comprised of a GPU-resident memory pool, a multi-granularity inference engine, and an elastic batch scheduler. The memory pool eliminates the unnecessary waiting of the batching operation and enables data transfer-computation overlap. The inference engine enables concurrent execution of different batches, improving the GPUs resource utilization. The batch scheduler organizes inference requests elastically. Our experimental results on an Nvidia Titan RTX GPU show that Ebird reduces the response latency of inferences by up to 70.9% and improves the throughput by up to 49.3% while guaranteeing the QoS target compared with TensorFlow Serving.
引用
收藏
页码:497 / 505
页数:9
相关论文
共 24 条
  • [1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
  • [2] [Anonymous], 2019, NVIDIA TURING ARCHIT
  • [3] [Anonymous], 2019, TENSORFLOW SERVING B
  • [4] [Anonymous], 2014, ARXIV NEURAL EVOLUTI
  • [5] [Anonymous], TECHNICAL REPORT
  • [6] [Anonymous], ARXIV151206216
  • [7] Awan AA, 2017, ACM SIGPLAN NOTICES, V52, P193, DOI [10.1145/3018743.3018769, 10.1145/3155284.3018769]
  • [8] Chen Q, 2017, OPER SYST REV, V51, P17, DOI 10.1145/3037697.3037700
  • [9] Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
    Chen, Quan
    Yang, Hailong
    Mars, Jason
    Tang, Lingjia
    [J]. ACM SIGPLAN NOTICES, 2016, 51 (04) : 681 - 696
  • [10] Crankshaw D, 2017, PROCEEDINGS OF NSDI '17: 14TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P613