Enhancing Distributed Neural Network Training Through Node-Based Communications

被引:1
作者
Moreno-Alvarez, Sergio [1 ]
Paoletti, Mercedes E. [2 ]
Cavallaro, Gabriele [3 ]
Haut, Juan M. [2 ]
机构
[1] Univ Extremadura, Escuela Politecn, Dept Ingn Sistemas Informat & Telemat, Caceres 10003, Spain
[2] Univ Extremadura, Dept Tecnol Comp & Comunicac, Escuela Politecn, Caceres 10003, Spain
[3] Forsch Zentrum Julich, Julich Supercomp Ctr, D-52428 Julich, Germany
基金
欧盟地平线“2020”;
关键词
Data parallelism; deep learning; high-performance computing (HPC); neural networks; synchronous communications; PARALLEL KERNELS; DEEP; OPTIMIZATION;
D O I
10.1109/TNNLS.2023.3309735
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The amount of data needed to effectively train modern deep neural architectures has grown significantly, leading to increased computational requirements. These intensive computations are tackled by the combination of last generation computing resources, such as accelerators, or classic processing units. Nevertheless, gradient communication remains as the major bottleneck, hindering the efficiency notwithstanding the improvements in runtimes obtained through data parallelism strategies. Data parallelism involves all processes in a global exchange of potentially high amount of data, which may impede the achievement of the desired speedup and the elimination of noticeable delays or bottlenecks. As a result, communication latency issues pose a significant challenge that profoundly impacts the performance on distributed platforms. This research presents node-based optimization steps to significantly reduce the gradient exchange between model replicas whilst ensuring model convergence. The proposal serves as a versatile communication scheme, suitable for integration into a wide range of general-purpose deep neural network (DNN) algorithms. The optimization takes into consideration the specific location of each replica within the platform. To demonstrate the effectiveness, different neural network approaches and datasets with disjoint properties are used. In addition, multiple types of applications are considered to demonstrate the robustness and versatility of our proposal. The experimental results show a global training time reduction whilst slightly improving accuracy.
引用
收藏
页码:17893 / 17907
页数:15
相关论文
共 59 条
  • [1] Bogoychev N, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P2991
  • [2] Bottou L, 1991, Proceedings of Neuro-Nimes, V91, P12
  • [3] Collective communication: theory, practice, and experience
    Chan, Ernie
    Heimlich, Marcel
    Purkayastha, Avi
    van de Geijn, Robert
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2007, 19 (13) : 1749 - 1783
  • [4] Cho M., 2019, PROC 2 SYSML, P1
  • [5] Clarke D, 2013, LECT NOTES COMPUT SC, V7979, P182, DOI 10.1007/978-3-642-39958-9_16
  • [6] Accelerating neural network training with distributed asynchronous and selective optimization (DASO)
    Coquelin, Daniel
    Debus, Charlotte
    Goetz, Markus
    von der Lehr, Fabrice
    Kahn, James
    Siggel, Martin
    Streit, Achim
    [J]. JOURNAL OF BIG DATA, 2022, 9 (01)
  • [7] Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]
  • [8] Fundamental Technologies in Modern Speech Recognition
    Furui, Sadaoki
    Deng, Li
    Gales, Mark
    Ney, Hermann
    Tokuda, Keiichi
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) : 16 - 17
  • [9] Generative Adversarial Networks
    Goodfellow, Ian
    Pouget-Abadie, Jean
    Mirza, Mehdi
    Xu, Bing
    Warde-Farley, David
    Ozair, Sherjil
    Courville, Aaron
    Bengio, Yoshua
    [J]. COMMUNICATIONS OF THE ACM, 2020, 63 (11) : 139 - 144
  • [10] Han K, 2022, Arxiv, DOI arXiv:2012.12556