Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

被引:5
作者
Hu, Zhiyao [1 ]
Li, Dongsheng [2 ]
Zhang, Dongxiang [4 ]
Zhang, Yiming [3 ]
Peng, Baoyun [1 ]
机构
[1] Natl Univ Def Technol, Changsha 410073, Peoples R China
[2] Natl Univ Def Technol, Coll Comp, Comp Sci, Changsha 410073, Peoples R China
[3] Natl Univ Def Technol, Sch Comp, Changsha 410073, Peoples R China
[4] Zhejiang Univ, Hangzhou 310027, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Sparks; Resource management; Predictive models; Training; Task analysis; Transfer learning; Adaptation models; Data-parallel job; resource allocation; performance prediction; sampling overhead;
D O I
10.1109/TPDS.2021.3055019
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Under-allocating or over-allocating computation resources (e.g., CPU cores) can prolong the completion time of data-parallel jobs in a distributed system. We present a predictor, ReLocag, to find the near-optimal number of CPU cores to minimize job completion time (JCT). ReLocag includes a graph convolutional network (GCN) and a fully-connected network (FCNN). The GCN learns the dependency between operations from the workflow of a job, and then the FCNN takes the workflow dependency together with other features (e.g., the input size, the number of CPU cores, the amount of memory, and the number of computation tasks) as input for JCT prediction. The prediction result can guide the user to determine the near-optimal number of CPU cores. Besides, we propose two effective strategies to overcome the time-consuming issue of training sample collection in big data applications. First, we develop an adaptive sampling method to collect essential samples judiciously. Second, we further design a cross-application transfer learning model to exploit the training samples collected from other applications. We conduct extensive experiments in a Spark cluster for 7 types of exemplary Spark applications. Results show that ReLocag improves the JCT prediction accuracy by 4-14 percent. Moreover, the CPU core consumption decreases by 58.2 percent.
引用
收藏
页码:2188 / 2201
页数:14
相关论文
共 40 条
  • [1] Alipourfard O, 2017, PROCEEDINGS OF NSDI '17: 14TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P469
  • [2] [Anonymous], 2018, SKLEARN RANDOMFOREST
  • [3] [Anonymous], 2021, PARALLEL DISTRIBUTED, V32
  • [4] [Anonymous], 2018, The Apache Hadoop Project
  • [5] [Anonymous], 2018, ERNEST PROJECT
  • [6] [Anonymous], 2018, APACHE SPARK PROJECT
  • [7] Learn-as-you-go with Megh: Efficient Live Migration of Virtual Machines
    Basu, Debabrota
    Wang, Xiayang
    Hong, Yang
    Chen, Haibo
    Bressan, Stephane
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (08) : 1786 - 1801
  • [8] RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration
    Bei, Zhendong
    Yu, Zhibin
    Zhang, Huiling
    Xiong, Wen
    Xu, Chengzhong
    Eeckhout, Lieven
    Feng, Shengzhong
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (05) : 1470 - 1483
  • [9] Coordinated Self-Configuration of Virtual Machines and Appliances Using a Model-Free Learning Approach
    Bu, Xiangping
    Rao, Jia
    Xu, Cheng-Zhong
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2013, 24 (04) : 681 - 690
  • [10] TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments
    Chiang, Ron C.
    Huang, H. Howie
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2014, 25 (05) : 1349 - 1358