Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

被引：5

作者：

Hu, Zhiyao ^{[1
]}

Li, Dongsheng ^{[2
]}

Zhang, Dongxiang ^{[4
]}

Zhang, Yiming ^{[3
]}

Peng, Baoyun ^{[1
]}

机构：

[1] Natl Univ Def Technol, Changsha 410073, Peoples R China

[2] Natl Univ Def Technol, Coll Comp, Comp Sci, Changsha 410073, Peoples R China

[3] Natl Univ Def Technol, Sch Comp, Changsha 410073, Peoples R China

[4] Zhejiang Univ, Hangzhou 310027, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2021年 / 32卷 / 09期

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Sparks; Resource management; Predictive models; Training; Task analysis; Transfer learning; Adaptation models; Data-parallel job; resource allocation; performance prediction; sampling overhead;

D O I：

10.1109/TPDS.2021.3055019

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Under-allocating or over-allocating computation resources (e.g., CPU cores) can prolong the completion time of data-parallel jobs in a distributed system. We present a predictor, ReLocag, to find the near-optimal number of CPU cores to minimize job completion time (JCT). ReLocag includes a graph convolutional network (GCN) and a fully-connected network (FCNN). The GCN learns the dependency between operations from the workflow of a job, and then the FCNN takes the workflow dependency together with other features (e.g., the input size, the number of CPU cores, the amount of memory, and the number of computation tasks) as input for JCT prediction. The prediction result can guide the user to determine the near-optimal number of CPU cores. Besides, we propose two effective strategies to overcome the time-consuming issue of training sample collection in big data applications. First, we develop an adaptive sampling method to collect essential samples judiciously. Second, we further design a cross-application transfer learning model to exploit the training samples collected from other applications. We conduct extensive experiments in a Spark cluster for 7 types of exemplary Spark applications. Results show that ReLocag improves the JCT prediction accuracy by 4-14 percent. Moreover, the CPU core consumption decreases by 58.2 percent.

引用

页码：2188 / 2201

页数：14

共 40 条

[1] Alipourfard O, 2017, PROCEEDINGS OF NSDI '17: 14TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P469
[2] [Anonymous], 2018, SKLEARN RANDOMFOREST
[3] [Anonymous], 2021, PARALLEL DISTRIBUTED, V32
[4] [Anonymous], 2018, The Apache Hadoop Project
[5] [Anonymous], 2018, ERNEST PROJECT
[6] [Anonymous], 2018, APACHE SPARK PROJECT
[7] Learn-as-you-go with Megh: Efficient Live Migration of Virtual Machines
Basu, Debabrota
Wang, Xiayang
Hong, Yang
Chen, Haibo
Bressan, Stephane
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (08) : 1786 - 1801
[8] RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration
Bei, Zhendong
Yu, Zhibin
Zhang, Huiling
Xiong, Wen
Xu, Chengzhong
Eeckhout, Lieven
Feng, Shengzhong
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (05) : 1470 - 1483
[9] Coordinated Self-Configuration of Virtual Machines and Appliances Using a Model-Free Learning Approach
Bu, Xiangping
Rao, Jia
Xu, Cheng-Zhong
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2013, 24 (04) : 681 - 690
[10] TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments
Chiang, Ron C.
Huang, H. Howie
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (05) : 1349 - 1358

← 1 2 3 4 →