Characterization and Prediction of Deep Learning Workloads in Large -Scale GPU Datacenters

被引:81
作者
Hu, Qinghao [1 ,2 ]
Sun, Peng [3 ]
Yan, Shengen [3 ]
Wen, Yonggang [1 ]
Zhang, Tianwei [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[2] Nanyang Technol Univ, S Lab, Singapore, Singapore
[3] SenseTime, Hong Kong, Peoples R China
来源
SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2021年
关键词
GPU Datacenter; Cluster Statistical Analysis; Deep Learning; Cluster Managernent System; Workload Scheduling Conservation; Time-series Prediction; SYSTEMS;
D O I
10.1145/3458817.3476223
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Modern CPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DI, jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.
引用
收藏
页数:15
相关论文
共 84 条
[21]  
Ferguson Andrew D., 2012, P 7 ACM EUROPEAN C C
[22]   Effects of Dynamic Voltage and Frequency Scaling on a K20 GPU [J].
Ge, Rong ;
Vogt, Ryan ;
Majumder, Jahangir ;
Alam, Arif ;
Burtscher, Martin ;
Zong, Ziliang .
2013 42ND ANNUAL INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2013, :826-833
[23]  
Grandl R, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P81
[24]  
Grandl R, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P65
[25]  
Gu JC, 2019, PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P485
[26]  
Hamilton D., 1994, Time Series Analysis
[27]  
Han Zhenhua, 2020, 14 USENIX S OPERATIN
[28]  
Hindman B., 2011, 8 USENIX S NETWORKED
[29]   Another look at measures of forecast accuracy [J].
Hyndman, Rob J. ;
Koehler, Anne B. .
INTERNATIONAL JOURNAL OF FORECASTING, 2006, 22 (04) :679-688
[30]   Algorithms for Power Savings [J].
Irani, Sandy ;
Shukla, Sandeep ;
Gupta, Rajesh .
ACM TRANSACTIONS ON ALGORITHMS, 2007, 3 (04)