A Dynamic Layer-Wise Gradient Sparsity and Gradient Merging Optimization Method for Deep Neural Networks

被引：0

作者：

Ju, Tao ^{[1
]}

Kang, Heting ^{[1
]}

Liu, Shuai ^{[1
]}

Huo, Jiuyuan ^{[1
]}

机构：

[1] School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou

来源：

Hsi-An Chiao Tung Ta Hsueh/Journal of Xi'an Jiaotong University | 2024年 / 58卷 / 09期

关键词：

deep neural network; distributed training; gradient compression; layer gradient merging; synchronous data parallelism;

D O I：

10.7652/xjtuxb202409011

中图分类号：

学科分类号：

摘要：

A dynamic layer-wise gradient sparsity and gradient aggregation optimization strategy for deep neural net-works is proposed to address the challenges posed by substantial communication overhead, prolonged training duration, and suboptimal resource utilization associated -with the acceleration of large-scale deep neural net-works through data parallelism. Initially, a dynamic layer-wise gradient sparsity optimization method is proposed by combining gradient sparsity compression with pipeline parallelism. Each neural net-work layer is assigned an appropriate threshold, -which is adjusted dynamically in subsequent iterations to achieve adaptive compression of gradient transmission for each layer. Subsequently, a layer-wise gradient merging method is introduced. Leveraging dynamic programming, this method optimizes communication overhead, sparsity, and layer gradient computation time during layer-wise gradient merging, determining the optimal combination for merging multiple layers of small-scale gradient tensors into a single communication layer. This aims to reduce the high communication latency introduced during layer-wise gradient decision-making. Finally, the determined optimal layer-wise gradient merging combination is applied to the specific training iteration process. Experimental results demonstrate that the proposed method, compared to existing methods, significantly reduces communication overhead and enhances model training speed while ensuring model training accuracy. It achieves a maximum training speed up of 1. 99 times compared to the uncompressed method. © 2024 Xi'an Jiaotong University. All rights reserved.

引用

页码：105 / 116

页数：11

共 28 条

[1]

ZHU Hongrui, YUAN Guojun, YAO Chengji, Et al., Survey on network of distributed deep learning training, Journal of Computer Research and Development, 58, 1, pp. 98-115, (2021)

[2]

WANG Shuai, LI Dan, Research progress on network performance optimization of distributed machine learning system, Chinese Journal of Computers, 45, 7, pp. 1384-1411, (2022)

[3]

JU Tao, ZHAO Yuyang, LIU Shuai, Et al., A parallel optimization method of deep learning model for image recognition, Journal of Xi'an Jiaotong University, 57, 1, pp. 141-151, (2023)

[4]

LI Mu, ANDERSEN D G, PARK J W, Et al., Scaling distributed machine learning with the parameter server, Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, pp. 583-598, (2014)

[5]

ZHU Huming, LI Pei, JIAO Licheng, Et al., Review of parallel deep neural network, Chinese Journal of Computers, 41, 8, pp. 1861-1881, (2018)

[6]

BOTTOU L, CURTIS F E, NOCEDAL J, Et al., Optimization methods for large-scale machine learning, SIAM Review, 60, 2, pp. 223-311, (2018)

[7]

GAO Heran, WU Heng, XU Yuanjia, Et al., Survey on memory swapping mechanism for deep learning training, Journal of Software, 34, 12, pp. 5862-5886, (2023)

[8]

JU Tao, LIU Shuai, WANG Zhiqiang, Et al., Task segmentation and parallel optimization of DNN model [J/OL], Journal of Beijing University of Aeronautics and Astronautics, pp. 1-18

[9]

YAN Guangfeng, LI Tan, WU Kui, Et al., Killing two birds with one stone: quantization achieves privacy in distributed learning, Digital Signal Processing, 146, (2024)

[10]

CHEN Shida, LIU Qiang, HAN Liang, Gradient sparsification compression approach to reducing communication in distributed training, Journal of Zhejiang University(Engineering Science), 55, 2, pp. 386-394, (2021)

← 1 2 3 →