A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks

被引:8
作者
Barrachina, Sergio [1 ]
Castello, Adrian [1 ]
Catalan, Mar [1 ]
Dolz, Manuel F. [1 ]
Mestre, Jose, I [1 ]
机构
[1] Univ Jaume I Castellon, Dept Ingn & Ciencia Comp, Castellon de La Plana, Spain
来源
2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW) | 2021年
关键词
Deep neural networks; distributed parallel training; !text type='Python']Python[!/text; graphics processing units (GPUs);
D O I
10.1109/IPDPSW52791.2021.00110
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We present PyDTNN, a framework for training deep neural networks (DNNs) on clusters of computers that has been designed as a research-oriented tool with a low learning curve. Our parallel training framework offers a set of functionalities that cover several must-have features for advanced deep learning (DL) software: 1) it is developed in Python in order to expose an accessible entry point for the newcomer; 2) it is extensible, allowing users to prototype new research ideas without requiring them to deal with complex software-stacks; and 3) it delivers high parallel performance, exploiting MPI via mpi4py/NCCL for communication; and NumPy, cuDNN, and cuBLAS for computation. This paper provides practical evidence that PyDTNN attains similar accuracy and parallel performance to those exhibited by Google's TensorFlow (TF), though we recognize that PyDTNN cannot compete with a production-level framework such as TF or PyTorch in terms of maturity and functionality. Instead, PyDTNN is designed as an accessible and customizable tool for prototyping ideas related to distributed training of DNN models on clusters.
引用
收藏
页码:730 / 739
页数:10
相关论文
共 21 条
[1]  
Aji Alham Fikri, 2017, P 2017 C EMP METH NA, DOI DOI 10.18653/V1/D17-1045
[2]  
[Anonymous], 2009, COMPUTING RES REPOSI
[3]  
[Anonymous], 2003, Iterative Methods for Sparse Linear Systems, DOI DOI 10.1137/1.9780898718003
[4]   Toward a modular precision ecosystem for high-performance computing [J].
Anzt, Hartwig ;
Flegar, Goran ;
Gruetzmacher, Thomas ;
Quintana-Orti, Enrique S. .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2019, 33 (06) :1069-1078
[5]   Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis [J].
Ben-Nun, Tal ;
Hoefler, Torsten .
ACM COMPUTING SURVEYS, 2019, 52 (04)
[6]   Theoretical Scalability Analysis of Distributed Deep Convolutional Neural Networks [J].
Castello, Adrian ;
Dolz, Manuel F. ;
Quintana-Orti, Enrique S. ;
Duato, Jose .
2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2019, :534-541
[7]  
Dryden N, 2016, PROCEEDINGS OF 2016 2ND WORKSHOP ON MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC), P1, DOI [10.1109/MLHPC.2016.004, 10.1109/MLHPC.2016.4]
[8]  
Elsen Erich, 2020, P IEEE CVF C COMP VI
[9]   FloatX: A C plus plus Library for Customized Floating-Point Arithmetic [J].
Flegar, Goran ;
Scheidegger, Florian ;
Novakovic, Vedran ;
Mariani, Giovani ;
Tomas, Andres E. ;
Malossi, A. Cristiano, I ;
Quintana-Orti, Enrique S. .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2019, 45 (04)
[10]   On asynchronous iterations [J].
Frommer, A ;
Szyld, DB .
JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2000, 123 (1-2) :201-216