Characterization and Prediction of Deep Learning Workloads in Large -Scale GPU Datacenters

被引:62
作者
Hu, Qinghao [1 ,2 ]
Sun, Peng [3 ]
Yan, Shengen [3 ]
Wen, Yonggang [1 ]
Zhang, Tianwei [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[2] Nanyang Technol Univ, S Lab, Singapore, Singapore
[3] SenseTime, Hong Kong, Peoples R China
来源
SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2021年
关键词
GPU Datacenter; Cluster Statistical Analysis; Deep Learning; Cluster Managernent System; Workload Scheduling Conservation; Time-series Prediction; SYSTEMS;
D O I
10.1145/3458817.3476223
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Modern CPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DI, jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.
引用
收藏
页数:15
相关论文
共 84 条
  • [1] Power and Performance Characterization and Modeling of GPU-Accelerated Systems
    Abe, Yuki
    Inoue, Koji
    Sasaki, Hiroshi
    Edahiro, Masato
    Kato, Shinpei
    Peres, Martin
    [J]. 2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
  • [2] Amvrosiadis G, 2018, PROCEEDINGS OF THE 2018 USENIX ANNUAL TECHNICAL CONFERENCE, P533
  • [3] [Anonymous], 2014, P ACM S CLOUD COMPUT
  • [4] [Anonymous], 1996, JOB SCHEDULING PROCE
  • [5] Trends in supercomputing: The European path to exascale
    Attig, N.
    Gibbon, P.
    Lippert, Th
    [J]. COMPUTER PHYSICS COMMUNICATIONS, 2011, 182 (09) : 2041 - 2046
  • [6] Bahdanau D., 2014, ARXIV PREPRINT ARXIV
  • [7] Blocher Marcel, 2021, P 26 ACM INT C ARCHI
  • [8] Boutin E, 2014, P 11 USENIX C OP SYS
  • [9] Understanding GPU Power: A Survey of Profiling, Modeling, and Simulation Methods
    Bridges, Robert A.
    Imam, Neena
    Mintz, Tiffany M.
    [J]. ACM COMPUTING SURVEYS, 2016, 49 (03)
  • [10] Brown Tom, 2020, ADV NEURAL INFORM PR