Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

被引:6
作者
Hu, Zhiyao [1 ]
Li, Dongsheng [2 ]
Zhang, Dongxiang [4 ]
Zhang, Yiming [3 ]
Peng, Baoyun [1 ]
机构
[1] Natl Univ Def Technol, Changsha 410073, Peoples R China
[2] Natl Univ Def Technol, Coll Comp, Comp Sci, Changsha 410073, Peoples R China
[3] Natl Univ Def Technol, Sch Comp, Changsha 410073, Peoples R China
[4] Zhejiang Univ, Hangzhou 310027, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Sparks; Resource management; Predictive models; Training; Task analysis; Transfer learning; Adaptation models; Data-parallel job; resource allocation; performance prediction; sampling overhead;
D O I
10.1109/TPDS.2021.3055019
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Under-allocating or over-allocating computation resources (e.g., CPU cores) can prolong the completion time of data-parallel jobs in a distributed system. We present a predictor, ReLocag, to find the near-optimal number of CPU cores to minimize job completion time (JCT). ReLocag includes a graph convolutional network (GCN) and a fully-connected network (FCNN). The GCN learns the dependency between operations from the workflow of a job, and then the FCNN takes the workflow dependency together with other features (e.g., the input size, the number of CPU cores, the amount of memory, and the number of computation tasks) as input for JCT prediction. The prediction result can guide the user to determine the near-optimal number of CPU cores. Besides, we propose two effective strategies to overcome the time-consuming issue of training sample collection in big data applications. First, we develop an adaptive sampling method to collect essential samples judiciously. Second, we further design a cross-application transfer learning model to exploit the training samples collected from other applications. We conduct extensive experiments in a Spark cluster for 7 types of exemplary Spark applications. Results show that ReLocag improves the JCT prediction accuracy by 4-14 percent. Moreover, the CPU core consumption decreases by 58.2 percent.
引用
收藏
页码:2188 / 2201
页数:14
相关论文
共 40 条
[11]   TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments [J].
Chiang, Ron C. ;
Huang, H. Howie .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (05) :1349-1358
[12]   StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation [J].
Choi, Yunjey ;
Choi, Minje ;
Kim, Munyoung ;
Ha, Jung-Woo ;
Kim, Sunghun ;
Choo, Jaegul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8789-8797
[13]   Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters [J].
Delimitrou, Christina ;
Kozyrakis, Christos .
ACM SIGPLAN NOTICES, 2013, 48 (04) :77-88
[14]   Toward a generic representation of random variables for machine learning [J].
Donnat, Philippe ;
Marti, Gautier ;
Very, Philippe .
PATTERN RECOGNITION LETTERS, 2016, 70 :24-31
[15]   Borrowing Treasures from the Wealthy: Deep Transfer Learning through Selective Joint Fine-Tuning [J].
Ge, Weifeng ;
Yu, Yizhou .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :10-19
[16]  
Isard M., 2007, Operating Systems Review, V41, P59, DOI 10.1145/1272998.1273005
[17]   Decision-Making Approaches for Performance QoS in Distributed Storage Systems: A Survey [J].
Karniavoura, Flora ;
Magoutis, Kostas .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (08) :1906-1919
[18]  
Kipf T. N, 2016, P INT C LEARN REPR, P1554
[19]  
Li D., 2016, P 53 ACM EDAC IEEE D
[20]  
Long MS, 2017, PR MACH LEARN RES, V70