HAL: Computer System for Scalable Deep Learning

被引:45
作者
Kindratenko, Volodymyr [1 ]
Mu, Dawei [1 ]
Zhan, Yan [1 ]
Maloney, John [1 ]
Hashemi, Sayed Hadi [1 ]
Rabe, Benjamin [1 ]
Xu, Ke [1 ]
Campbell, Roy [1 ]
Peng, Jian [1 ]
Gropp, William [1 ]
机构
[1] UIUC, Natl Ctr Supercomp Applicat, Urbana, IL 61801 USA
来源
PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2020, PEARC 2020 | 2020年
基金
美国国家科学基金会;
关键词
deep learning; cluster architecture; high-performance computing;
D O I
10.1145/3311790.3396649
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We describe the design, deployment and operation of a computer system built to efficiently run deep learning frameworks. The system consists of 16 IBM POWER9 servers with 4 NVIDIA V100 GPUs each, interconnected with Mellanox EDR InfiniBand fabric, and a DDN all-flash storage array. The system is tailored towards efficient execution of the IBM Watson Machine Learning enterprise software stack that combines popular open-source deep learning frameworks. We build a custom management software stack to enable an efficient use of the system by a diverse community of users and provide guides and recipes for running deep learning workloads at scale utilizing all available GPUs. We demonstrate scaling of a PyTorch and TensorFlow based deep neural networks to produce state-of-the-art performance results.
引用
收藏
页码:41 / 48
页数:8
相关论文
共 11 条
[1]  
[Anonymous], 2015, Tensor Flow: Largescale Machine Learning on Heterogeneous Systems Software
[2]  
Caldeira A.B., 2018, IBM REDBOOKS
[3]  
Goyal P, 2018, Arxiv, DOI arXiv:1706.02677
[4]  
Graber C, 2019, ADV NEUR IN, V32
[5]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[6]   Singularity: Scientific containers for mobility of compute [J].
Kurtzer, Gregory M. ;
Sochat, Vanessa ;
Bauer, Michael W. .
PLOS ONE, 2017, 12 (05)
[7]  
Lin J., 2019, C NEUR INF PROC SYST
[8]  
Paszke A, 2019, ADV NEUR IN, V32
[9]  
Sergeev A, 2018, Arxiv, DOI arXiv:1802.05799
[10]   Gravitational wave denoising of binary black hole mergers with deep learning [J].
Wei, Wei ;
Huerta, E. A. .
PHYSICS LETTERS B, 2020, 800