Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning

被引：0

作者：

Zhang, Lin ^{[1
]}

Shi, Shaohuai ^{[2
]}

Wang, Wei ^{[1
]}

Li, Bo ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Peoples R China

[2] Harbin Inst Technol, Sch Comp Sci & Technol, Shenzhen 518055, Guangdong, Peoples R China

来源：

IEEE TRANSACTIONS ON CLOUD COMPUTING | 2023年 / 11卷 / 03期

关键词：

Training; Computational modeling; Clustering algorithms; Graphics processing units; Memory management; Deep learning; Convergence; Distributed deep learning; K-FAC; performance optimization; second-order; NATURAL GRADIENT;

D O I：

10.1109/TCC.2022.3205918

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training on GPU clusters. However, existing D-KFAC algorithms require to compute and communicate a large volume of second-order information, i.e., Kronecker factors (KFs), before preconditioning gradients, resulting in large computation and communication overheads as well as a high memory footprint. In this article, we propose DP-KFAC, a novel distributed preconditioning scheme that distributes the KF constructing tasks at different DNN layers to different workers. DP-KFAC not only retains the convergence property of the existing D-KFAC algorithms but also enables three benefits: reduced computation overhead in constructing KFs, no communication of KFs, and low memory footprint. Extensive experiments on a 64-GPU cluster show that DP-KFAC reduces the computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update compared to the state-of-the-art D-KFAC methods.

引用

页码：2365 / 2378

页数：14

共 50 条

[31] High Performance Training of Deep Neural Networks Using Pipelined Hardware Acceleration and Distributed Memory
Mehta, Ragav
Huang, Yuyang
Cheng, Mingxi
Bagga, Shrey
Mathur, Nishant
Li, Ji
Draper, Jeffrey
Nazarian, Shahin
2018 19TH INTERNATIONAL SYMPOSIUM ON QUALITY ELECTRONIC DESIGN (ISQED), 2018, : 383 - 388
[32] Distributed B-SDLM: Accelerating the Training Convergence of Deep Neural Networks Through Parallelism
Liew, Shan Sung
Khalil-Hani, Mohamed
Bakhteri, Rabia
PRICAI 2016: TRENDS IN ARTIFICIAL INTELLIGENCE, 2016, 9810 : 243 - 250
[33] An efficient bandwidth-adaptive gradient compression algorithm for distributed training of deep neural networks
Wang, Zeqin
Duan, Qingyang
Xu, Yuedong
Zhang, Liang
JOURNAL OF SYSTEMS ARCHITECTURE, 2024, 150
[34] The Impact of Architecture on the Deep Neural Networks Training
Rozycki, Pawel
Kolbusz, Janusz
Malinowski, Aleksander
Wilamowski, Bogdan
2019 12TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTION (HSI), 2019, : 41 - 46
[35] An Optimization Strategy for Deep Neural Networks Training
Wu, Tingting
Zeng, Peng
Song, Chunhe
2022 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, COMPUTER VISION AND MACHINE LEARNING (ICICML), 2022, : 596 - 603
[36] DANTE: Deep alternations for training neural networks
Sinha, Vaibhav B.
Kudugunta, Sneha
Sankar, Adepu Ravi
Chavali, Surya Teja
Balasubramanian, Vineeth N.
NEURAL NETWORKS, 2020, 131 : 127 - 143
[37] Heterogeneous gradient computing optimization for scalable deep neural networks
Moreno-Alvarez, Sergio
Paoletti, Mercedes E.
Rico-Gallego, Juan A.
Haut, Juan M.
JOURNAL OF SUPERCOMPUTING, 2022, 78 (11) : 13455 - 13469
[38] Heterogeneous gradient computing optimization for scalable deep neural networks
Sergio Moreno-Álvarez
Mercedes E. Paoletti
Juan A. Rico-Gallego
Juan M. Haut
The Journal of Supercomputing, 2022, 78 : 13455 - 13469
[39] Appropriate Learning Rates of Adaptive Learning Rate Optimization Algorithms for Training Deep Neural Networks
Iiduka, Hideaki
IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (12) : 13250 - 13261
[40] An Efficient Method for Training Deep Learning Networks Distributed
Wang, Chenxu
Lu, Yutong
Chen, Zhiguang
Li, Junnan
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2020, E103D (12) : 2444 - 2456

← 1 2 3 4 5 →