Characterization and Prediction of Deep Learning Workloads in Large -Scale GPU Datacenters

被引：62

作者：

Hu, Qinghao ^{[1
,2
]}

Sun, Peng ^{[3
]}

Yan, Shengen ^{[3
]}

Wen, Yonggang ^{[1
]}

Zhang, Tianwei ^{[1
]}

机构：

[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore

[2] Nanyang Technol Univ, S Lab, Singapore, Singapore

[3] SenseTime, Hong Kong, Peoples R China

来源：

SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2021年

关键词：

GPU Datacenter; Cluster Statistical Analysis; Deep Learning; Cluster Managernent System; Workload Scheduling Conservation; Time-series Prediction; SYSTEMS;

D O I：

10.1145/3458817.3476223

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Modern CPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DI, jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.

引用

页数：15

共 84 条

[1] Power and Performance Characterization and Modeling of GPU-Accelerated Systems
Abe, Yuki
Inoue, Koji
Sasaki, Hiroshi
Edahiro, Masato
Kato, Shinpei
Peres, Martin
[J]. 2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
[2] Amvrosiadis G, 2018, PROCEEDINGS OF THE 2018 USENIX ANNUAL TECHNICAL CONFERENCE, P533
[3] [Anonymous], 2014, P ACM S CLOUD COMPUT
[4] [Anonymous], 1996, JOB SCHEDULING PROCE
[5] Trends in supercomputing: The European path to exascale
Attig, N.
Gibbon, P.
Lippert, Th
[J]. COMPUTER PHYSICS COMMUNICATIONS, 2011, 182 (09) : 2041 - 2046
[6] Bahdanau D., 2014, ARXIV PREPRINT ARXIV
[7] Blocher Marcel, 2021, P 26 ACM INT C ARCHI
[8] Boutin E, 2014, P 11 USENIX C OP SYS
[9] Understanding GPU Power: A Survey of Profiling, Modeling, and Simulation Methods
Bridges, Robert A.
Imam, Neena
Mintz, Tiffany M.
[J]. ACM COMPUTING SURVEYS, 2016, 49 (03)
[10] Brown Tom, 2020, ADV NEURAL INFORM PR

← 1 2 3 4 5 6 7 8 9 →