CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU-GPU system

被引：1

作者：

Zhang, Qi ^{[1
]}

Liu, Yi ^{[1
]}

Liu, Tao ^{[2
]}

Qian, Depei ^{[1
]}

机构：

[1] Beihang Univ, Sch Comp Sci & Engn, 37 Xueyuan Rd, Beijing 100190, Peoples R China

[2] Shandong Prov Key Lab Comp Networks, 28666 Jingshi Dong Lu, Jinan 250103, Shandong, Peoples R China

来源：

JOURNAL OF SUPERCOMPUTING | 2023年 / 79卷 / 13期

关键词：

Deep learning; Inference; Quality of service; Tail latency; GPU;

D O I：

10.1007/s11227-023-05183-6

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent years have witnessed significant achievements in deep learning (DL) technologies. In the meantime, an increasing number of online service operators take advantage of deep learning to provide intelligent and personalized services. Although significant efforts have been put into optimizing the inference efficiency, our investigation shows that for many DL models that process data-intensive requests, the network I/O subsystem also plays an essential role in determining responsiveness. Furthermore, under the latency constraint, uncontrolled network flow processing will impact request batching. Based on the above observation, this paper proposes CoFB, an inference service system that optimizes performance in a holistic way. CoFB improves the load imbalance in the network I/O subsystem with a lightweight flow scheduling scheme that collaborates the network interface card with a dispatcher thread. In addition, CoFB introduces a request reordering and batching policy and an interference-aware concurrent batch throttling strategy for enforcing inference concerning the deadline. We evaluate CoFB on four DL inference services and compare it to two state-of-the-art inference systems: NVIDIA Triton and DVABatch. Experimental results show that CoFB outperforms these two baselines by serving up to 2.69x and 1.96x higher load under preset tail latency objectives, respectively.

引用

页码：14172 / 14199

页数：28

共 47 条

[1] Belay Adam, 2014, Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14). OSDI '14, P49
[2] Brandeburg J, 2017, TOOL MEASURING NIC B
[3] Burns E, 2017, FACEBOOK USES DEEP L
[4] Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579
[5] LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference
Choi, Yujeong
Kim, Yunseong
Rhu, Minsoo
[J]. 2021 27TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2021), 2021, : 493 - 506
[6] PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units
Choi, Yujeong
Rhu, Minsoo
[J]. 2020 IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2020), 2020, : 220 - 233
[7] Corportion N, 2019, NVID TENS COR UNPR A
[8] Crankshaw D, 2017, PROCEEDINGS OF NSDI '17: 14TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P613
[9] Cui WH, 2022, PROCEEDINGS OF THE 2022 USENIX ANNUAL TECHNICAL CONFERENCE, P183
[10] Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction
Cui, Weihao
Zhao, Han
Chen, Quan
Zheng, Ningxin
Leng, Jingwen
Zhao, Jieru
Song, Zhuo
Ma, Tao
Yang, Yong
Li, Chao
Guo, Minyi
[J]. SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,

← 1 2 3 4 5 →