Characterization and Prediction of Deep Learning Workloads in Large -Scale GPU Datacenters

被引:62
作者
Hu, Qinghao [1 ,2 ]
Sun, Peng [3 ]
Yan, Shengen [3 ]
Wen, Yonggang [1 ]
Zhang, Tianwei [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[2] Nanyang Technol Univ, S Lab, Singapore, Singapore
[3] SenseTime, Hong Kong, Peoples R China
来源
SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2021年
关键词
GPU Datacenter; Cluster Statistical Analysis; Deep Learning; Cluster Managernent System; Workload Scheduling Conservation; Time-series Prediction; SYSTEMS;
D O I
10.1145/3458817.3476223
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Modern CPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DI, jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design (1) a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; (2) a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.
引用
收藏
页数:15
相关论文
共 84 条
  • [11] Burns B, 2016, Queue, V14, P70, DOI DOI 10.1145/2898442.2898444
  • [12] Chau Vincent, 2017, P 8 INT C FUTURE ENE
  • [13] Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning
    Chaudhary, Shubham
    Ramjee, Ramachandran
    Sivathanu, Muthian
    Kwatra, Nipun
    Viswanatha, Srinidhi
    [J]. PROCEEDINGS OF THE FIFTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS (EUROSYS'20), 2020,
  • [14] Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads
    Chen, Yanpei
    Alspaugh, Sara
    Katz, Randy
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (12): : 1802 - 1813
  • [15] Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms
    Cortez, Eli
    Bonde, Anand
    Muzio, Alexandre
    Russinovich, Mark
    Fontoura, Marcus
    Bianchini, Ricardo
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP '17), 2017, : 153 - 167
  • [16] Deep Neural Networks for YouTube Recommendations
    Covington, Paul
    Adams, Jay
    Sargin, Emre
    [J]. PROCEEDINGS OF THE 10TH ACM CONFERENCE ON RECOMMENDER SYSTEMS (RECSYS'16), 2016, : 191 - 198
  • [17] Data Center Energy Consumption Modeling: A Survey
    Dayarathna, Miyuru
    Wen, Yonggang
    Fan, Rui
    [J]. IEEE COMMUNICATIONS SURVEYS AND TUTORIALS, 2016, 18 (01): : 732 - 794
  • [18] Delgado Pamela, 2015, USENIX ANN TECHNICAL
  • [19] Douglas C., 2013, P 4 ANN S
  • [20] Fedus W., 2021, ARXIV210103961