CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU-GPU system

被引:1
作者
Zhang, Qi [1 ]
Liu, Yi [1 ]
Liu, Tao [2 ]
Qian, Depei [1 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, 37 Xueyuan Rd, Beijing 100190, Peoples R China
[2] Shandong Prov Key Lab Comp Networks, 28666 Jingshi Dong Lu, Jinan 250103, Shandong, Peoples R China
关键词
Deep learning; Inference; Quality of service; Tail latency; GPU;
D O I
10.1007/s11227-023-05183-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recent years have witnessed significant achievements in deep learning (DL) technologies. In the meantime, an increasing number of online service operators take advantage of deep learning to provide intelligent and personalized services. Although significant efforts have been put into optimizing the inference efficiency, our investigation shows that for many DL models that process data-intensive requests, the network I/O subsystem also plays an essential role in determining responsiveness. Furthermore, under the latency constraint, uncontrolled network flow processing will impact request batching. Based on the above observation, this paper proposes CoFB, an inference service system that optimizes performance in a holistic way. CoFB improves the load imbalance in the network I/O subsystem with a lightweight flow scheduling scheme that collaborates the network interface card with a dispatcher thread. In addition, CoFB introduces a request reordering and batching policy and an interference-aware concurrent batch throttling strategy for enforcing inference concerning the deadline. We evaluate CoFB on four DL inference services and compare it to two state-of-the-art inference systems: NVIDIA Triton and DVABatch. Experimental results show that CoFB outperforms these two baselines by serving up to 2.69x and 1.96x higher load under preset tail latency objectives, respectively.
引用
收藏
页码:14172 / 14199
页数:28
相关论文
共 47 条
  • [1] Belay Adam, 2014, Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI '14). OSDI '14, P49
  • [2] Brandeburg J, 2017, TOOL MEASURING NIC B
  • [3] Burns E, 2017, FACEBOOK USES DEEP L
  • [4] Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579
  • [5] LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference
    Choi, Yujeong
    Kim, Yunseong
    Rhu, Minsoo
    [J]. 2021 27TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2021), 2021, : 493 - 506
  • [6] PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units
    Choi, Yujeong
    Rhu, Minsoo
    [J]. 2020 IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2020), 2020, : 220 - 233
  • [7] Corportion N, 2019, NVID TENS COR UNPR A
  • [8] Crankshaw D, 2017, PROCEEDINGS OF NSDI '17: 14TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P613
  • [9] Cui WH, 2022, PROCEEDINGS OF THE 2022 USENIX ANNUAL TECHNICAL CONFERENCE, P183
  • [10] Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction
    Cui, Weihao
    Zhao, Han
    Chen, Quan
    Zheng, Ningxin
    Leng, Jingwen
    Zhao, Jieru
    Song, Zhuo
    Ma, Tao
    Yang, Yong
    Li, Chao
    Guo, Minyi
    [J]. SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2021,