A Novel Classification Model to Predict Batch Job Failures in Co-located Cloud

被引:0
作者
Li, Yurui [1 ]
Lin, Weiwei [1 ]
Li, Keqin [2 ]
Wang, James Z. [3 ]
Liu, Fagui [1 ]
Liu, Jie [1 ]
机构
[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] State Univ New York New Paltz, Dept Comp Sci, New Paltz, NY 12561 USA
[3] Scutech Corp Co Ltd, Guangzhou, Peoples R China
来源
2020 IEEE 26TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS) | 2020年
基金
中国国家自然科学基金;
关键词
cloud computing; co-located datacenter; failure prediction; resource efficiency; datacenter;
D O I
10.1109/ICPADS51040.2020.00080
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Nowadays, cloud co-location is often used for data centers to improve the utilization of computing resources. However, batch jobs in a Co-location Datacenter (CLD) are vulnerable to failures due to the competition for limited resources with online service jobs. Such failed batch jobs would be rescheduled and failed repeatedly, resulting in the waste of computing resources and instability of the computing clusters. Therefore, we propose a method to accurately predict the potential failures of batch jobs for CLD. The core of the proposed method is STLF (SMOTE Tomek and LightGBM [5] Framework), which is divided into three parts. First, we use the co-feature extraction method to generate Co-located Feature Dataset (CLFD). Then SMOTE Tomek is used to oversampling the CLFD to ensure that the classifier can learn more minority features. Finally, we use LightGBM classifier to predict batch jobs' failure. The performance experiments conducted on the Ali Trace 2018 dataset show that our proposed STLF significantly outperforms the existing popular classifiers in terms of the ROC curve, the area under the ROC curve (AUC), precision, and recall.
引用
收藏
页码:577 / 584
页数:8
相关论文
共 21 条
[1]   Towards Understanding the Usage Behavior of Google Cloud Users: The Mice and Elephants Phenomenon [J].
Abdul-Rahman, Omar Arif ;
Aida, Kento .
2014 IEEE 6TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGY AND SCIENCE (CLOUDCOM), 2014, :272-277
[2]  
Alibaba Inc, 2017, AL PROD CLUST DAT 20
[3]  
[Anonymous], 2018, Characterizing co-located datacenter workloads: An alibaba case study
[4]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[5]  
Chen S., 2019, 24 INT C
[6]  
Chen WY, 2018, INT C PAR DISTRIB SY, P102, DOI [10.1109/PADSW.2018.8644579, 10.1109/ICPADS.2018.00024]
[7]  
Guo J, 2019, WHO LIMITS RESOURCE, P39
[8]  
Hemmat R. A., 2016, ARXIV DISTRIBUTED PA
[9]  
Ke GL, 2017, ADV NEUR IN, V30
[10]   Failure prediction of tasks in the cloud at an earlier stage: a solution based on domain information mining [J].
Liu, Chunhong ;
Dai, Liping ;
Lai, Yi ;
Lai, Guibing ;
Mao, Wentao .
COMPUTING, 2020, 102 (09) :2001-2023