Performance-Aware Speculative Resource Oversubscription for Large-Scale Clusters

被引:19
作者
Yang, Renyu [1 ]
Hu, Chunming [2 ]
Sun, Xiaoyang [1 ]
Garraghan, Peter [6 ]
Wo, Tianyu [3 ,4 ]
Wen, Zhenyu [7 ]
Peng, Hao [5 ]
Xu, Jie [1 ]
Li, Chao [8 ]
机构
[1] Univ Leeds, Sch Comp, Leeds LS2 91T, W Yorkshire, England
[2] Beihang Univ, Sch Comp Sci & Engn, Beijing 100083, Peoples R China
[3] Beihang Univ, Sch Comp, Beijing 100083, Peoples R China
[4] Beihang Univ, State Key Lab Software Dev Environm, Beijing 100083, Peoples R China
[5] Beihang Univ, Beijing Adv Innovat Ctr Big Data & Brain Comp, Beijing 100083, Peoples R China
[6] Univ Lancaster, Sch Comp & Commun, Lancaster LA1 4YW, England
[7] Newcastle Univ, Sch Comp, Newcastle Upon Tyne NE1 7RU, Tyne & Wear, England
[8] Alibaba Grp, Engn, Hangzhou 310052, Peoples R China
基金
英国工程与自然科学研究理事会; 国家重点研发计划;
关键词
Resource scheduling; oversubscription; cluster utilization; resource throttling; QoS;
D O I
10.1109/TPDS.2020.2970013
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralized approaches to oversubscription suffer from the issue with resource mismatch and fail to take into account other performance requirements, e.g., tail latency. In this article we present ROSE, a new resource management platform capable of conducting performance-aware resource oversubscription. ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs. Instead of waiting for resource allocation to be confirmed by the centralized scheduler, job managers in ROSE can independently request to launch speculative tasks within specific machines according to their suitability for oversubscription. Node agents of those machines can however, avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle. Experiments show that in case of mixed co-location of batch jobs and latency-sensitive LRAs, the CPU utilization and the disk utilization can reach 56.34 and 43.49 percent, respectively, but the 95th percentile of read latency in YCSB workloads only increases by 5.4 percent against the case of executing the LRAs alone.
引用
收藏
页码:1499 / 1517
页数:19
相关论文
共 44 条
[1]  
[Anonymous], YARN OVERSUBSCRIPTIO
[2]  
[Anonymous], INTEL RDT SOFTWARE P
[3]  
[Anonymous], OVERCOMMIT
[4]  
[Anonymous], 1977, EXPLORATORY DATA ANA
[5]  
[Anonymous], 2019, ALIBABA CLUSTER TRAC
[6]  
[Anonymous], 2009, UCBEECS200955
[7]  
[Anonymous], YARN CAPACITY SCHEDU
[8]  
[Anonymous], YARN FAIR SCHEDULER
[9]  
[Anonymous], MESOS OVERSUBSCRIPTI
[10]  
Barker SeanKenneth., 2010, P 1 ANN ACM SIGMM C, P35