Reinforcement-Learning-Empowered MLaaS Scheduling for Serving Intelligent Internet of Things

被引:12
作者
Qin, Heyang [1 ]
Zawad, Syed [1 ]
Zhou, Yanqi [2 ]
Padhi, Sanjay [3 ]
Yang, Lei [1 ]
Yan, Feng [1 ]
机构
[1] Univ Nevada, Dept Comp Sci & Engn, Reno, NV 89557 USA
[2] Google, Google Brain, Mountain View, CA 94043 USA
[3] Amazon Web Serv, US Educ, Seattle, WA 98109 USA
基金
美国国家科学基金会;
关键词
Parallel processing; Internet of Things; Machine learning; Computational modeling; Graphics processing units; Cloud computing; Dynamic scheduling; Internet of Things (IoT); machine-learning-as-a-service (MLaaS); model inference; parallelism parameter tuning; reinforcement learning; service-level-objective (SLO); workload scheduling;
D O I
10.1109/JIOT.2020.2965103
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning (ML) has been embedded in many Internet of Things (IoT) applications (e.g., smart home and autonomous driving). Yet it is often infeasible to deploy ML models on IoT devices due to resource limitation. Thus, deploying trained ML models in the cloud and providing inference services to IoT devices becomes a plausible solution. To provide low-latency ML serving to massive IoT devices, a natural and promising approach is to use parallelism in computation. However, existing ML systems (e.g., Tensorflow) and cloud ML-serving platforms (e.g., SageMaker) are service-level-objective (SLO) agnostic and rely on users to manually configure the parallelism at both request and operation levels. To address this challenge, we propose a region-based reinforcement learning (RRL)-based scheduling framework for ML serving in IoT applications that can efficiently identify optimal configurations under dynamic workloads. A key observation is that the system performance under similar configurations in a region can be accurately estimated by using the system performance under one of these configurations due to their correlation. We theoretically show that the RRL approach can achieve fast convergence speed at the cost of performance loss. To improve the performance, we propose an adaptive RRL algorithm based on Bayesian optimization to balance the convergence speed and the optimality. The proposed framework is prototyped and evaluated on the Tensorflow Serving system. Extensive experimental results show that the proposed approach can outperform state-of-the-art approaches by finding near-optimal solutions over eight times faster while reducing inference latency up to 88.9% and reducing SLO violation up to 91.6%.
引用
收藏
页码:6325 / 6337
页数:13
相关论文
共 52 条
[1]  
Aarts E. H. L., 1990, WILEY INTERSCIENCE S
[2]  
Abadi Martin, 2015, TENSORFLOW LARGE SCA
[3]  
Abernethy Jacob, 2008, COLT CITESEER, P263
[4]  
Alipourfard O, 2017, PROCEEDINGS OF NSDI '17: 14TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P469
[5]  
Amodei D, 2016, PR MACH LEARN RES, V48
[6]  
Andreas J., 2016, P 2016 C N AM CHAPT, P1545, DOI [10.18653/v1/N16-1181, DOI 10.18653/V1/N16-1181]
[7]  
[Anonymous], 2012, Advances in neural information processing systems
[8]  
[Anonymous], 2016, ARXIV
[9]  
[Anonymous], 2010, P 27 INT C MACH LEAR, DOI 10.5555/3104322.3104425
[10]  
AWSLABS, 2019, MXNET MOD SERV