Sketch-fusion: A gradient compression method with multi-layer fusion for communication-efficient distributed training

被引：4

作者：

Dai, Lingfei ^{[1
,2
]}

Gong, Luqi ^{[1
]}

An, Zhulin ^{[1
]}

Xu, Yongjun ^{[1
]}

Diao, Boyu ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Coll Comp Sci, Beijing, Peoples R China

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2024年 / 185卷

关键词：

Gradient compression; Multi-layer fusion; Distributed stochastic gradient descent; Deep learning training;

D O I：

10.1016/j.jpdc.2023.104811

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Gradient compression is an effective technique for improving the efficiency of distributed training. However, introducing gradient compression can reduce model accuracy and training efficiency. Furthermore, we also find that using a layer-wise gradient compression algorithm would lead to significant compression and communication overhead, which can negatively impact the scaling efficiency of the distributed training system. To address these issues, we propose a new method called Sketch-Fusion SGD, which leverages the Count-Sketch data structure to enhance the scalability and training speed of distributed deep learning systems. Moreover, our method employs LayerFusion to optimize gradient compression algorithms' scalability and convergence efficiency by formulating an optimal multi-layer fusion strategy without introducing extra hyperparameters. We evaluate our method on a cluster of 16 GPUs and demonstrate that it can improve training efficiency by up to 18.6% without compromising the model's accuracy. In addition, we find that applying our LayerFusion algorithm to other gradient compression methods improved their scalability by up to 2.87x.

引用

页数：10

共 57 条

[51]

Xu E, 2019, Huawei launches ascend 910, the world's most powerful ai processor, and mindspore, an all-scenario ai computing framework

[52]

Xu H., 2020, Tech. Rep.

[53] Artificial intelligence: A powerful paradigm for scientific research [J].

Xu, Yongjun ;

Liu, Xin ;

Cao, Xin ;

Huang, Changping ;

Liu, Enke ;

Qian, Sen ;

Liu, Xingchen ;

Wu, Yanjun ;

Dong, Fengliang ;

Qiu, Cheng-Wei ;

Qiu, Junjun ;

Hua, Keqin ;

Su, Wentao ;

Wu, Jian ;

Xu, Huiyu ;

Han, Yong ;

Fu, Chenguang ;

Yin, Zhigang ;

Liu, Miao ;

Roepman, Ronald ;

Dietmann, Sabine ;

Virta, Marko ;

Kengara, Fredrick ;

Zhang, Ze ;

Zhang, Lifu ;

Zhao, Taolan ;

Dai, Ji ;

Yang, Jialiang ;

Lan, Liang ;

Luo, Ming ;

Liu, Zhaofeng ;

An, Tao ;

Zhang, Bin ;

He, Xiao ;

Cong, Shan ;

Liu, Xiaohong ;

Zhang, Wei ;

Lewis, James P. ;

Tiedje, James M. ;

Wang, Qi ;

An, Zhulin ;

Wang, Fei ;

Zhang, Libo ;

Huang, Tao ;

Lu, Chuan ;

Cai, Zhipeng ;

Wang, Fang ;

Zhang, Jiabao .

INNOVATION, 2021, 2 (04)

[54] ImageNet Training in Minutes [J].

You, Yang ;

Zhang, Zhao ;

Hsieh, Cho-Jui ;

Demmel, James ;

Keutzer, Kurt .

PROCEEDINGS OF THE 47TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 2018,

[55]

Zhang L, 2023, Arxiv, DOI arXiv:2302.12445

[56]

Zhen Zhang, 2020, NetAI '20: Proceedings of the Workshop on Network Meets AI & ML, P8, DOI 10.1145/3405671.3405810

[57]

Zong ZF, 2023, Arxiv, DOI arXiv:2211.12860

← 1 2 3 4 5 6 →