CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU-GPU system

被引:1
作者
Zhang, Qi [1 ]
Liu, Yi [1 ]
Liu, Tao [2 ]
Qian, Depei [1 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, 37 Xueyuan Rd, Beijing 100190, Peoples R China
[2] Shandong Prov Key Lab Comp Networks, 28666 Jingshi Dong Lu, Jinan 250103, Shandong, Peoples R China
关键词
Deep learning; Inference; Quality of service; Tail latency; GPU;
D O I
10.1007/s11227-023-05183-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recent years have witnessed significant achievements in deep learning (DL) technologies. In the meantime, an increasing number of online service operators take advantage of deep learning to provide intelligent and personalized services. Although significant efforts have been put into optimizing the inference efficiency, our investigation shows that for many DL models that process data-intensive requests, the network I/O subsystem also plays an essential role in determining responsiveness. Furthermore, under the latency constraint, uncontrolled network flow processing will impact request batching. Based on the above observation, this paper proposes CoFB, an inference service system that optimizes performance in a holistic way. CoFB improves the load imbalance in the network I/O subsystem with a lightweight flow scheduling scheme that collaborates the network interface card with a dispatcher thread. In addition, CoFB introduces a request reordering and batching policy and an interference-aware concurrent batch throttling strategy for enforcing inference concerning the deadline. We evaluate CoFB on four DL inference services and compare it to two state-of-the-art inference systems: NVIDIA Triton and DVABatch. Experimental results show that CoFB outperforms these two baselines by serving up to 2.69x and 1.96x higher load under preset tail latency objectives, respectively.
引用
收藏
页码:14172 / 14199
页数:28
相关论文
共 47 条
  • [11] E2bird: Enhanced Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services
    Cui, Weihao
    Chen, Quan
    Zhao, Han
    Wei, Mengze
    Tang, Xiaoxin
    Guo, Minyi
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (06) : 1307 - 1321
  • [12] Ebird: Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services
    Cui, Weihao
    Wei, Mengze
    Chen, Quan
    Tang, Xiaoxin
    Leng, Jingwen
    Li, Li
    Guo, Mingyi
    [J]. 2019 IEEE 37TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2019), 2019, : 497 - 505
  • [13] A review on viscosity retention of PAM solution for polymer flooding technology
    Du, Juan
    Lv, Chunhong
    Lan, Xitang
    Song, Jifeng
    Liu, Pingli
    Chen, Xiang
    Wang, Qiang
    Liu, Jinming
    Guo, Guixian
    [J]. PETROLEUM SCIENCE AND TECHNOLOGY, 2024, 42 (03) : 372 - 405
  • [14] eaChristopherOlston KirilGorovoy, 2016, TENSORFLOW SERVING
  • [15] Eunyoung Jeong, 2014, Proceedings of NSDI '14: 11th USENIX Symposium on Networked Systems Design and Implementation. NSDI '14, P489
  • [16] Fried J, 2020, PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20), P281
  • [17] Low Latency RNN Inference with Cellular Batching
    Gao, Pin
    Yu, Lingfan
    Wu, Yongwei
    Li, Jinyang
    [J]. EUROSYS '18: PROCEEDINGS OF THE THIRTEENTH EUROSYS CONFERENCE, 2018,
  • [18] Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration
    Gong, Yifan
    Yuan, Geng
    Zhan, Zheng
    Niu, Wei
    Li, Zhengang
    Zhao, Pu
    Cai, Yuxuan
    Liu, Sijia
    Ren, Bin
    Lin, Xue
    Tang, Xulong
    Wang, Yanzhi
    [J]. ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2022, 27 (05)
  • [19] Gujarati A, 2020, PROCEEDINGS OF THE 14TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION (OSDI '20), P443
  • [20] DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference
    Gupta, Udit
    Hsia, Samuel
    Saraph, Vikram
    Wang, Xiaodong
    Reagen, Brandon
    Wei, Gu-Yeon
    Lee, Hsien-Hsin S.
    Brooks, David
    Wu, Carole-Jean
    [J]. 2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, : 982 - 995