DC2: Delay-aware Compression Control for Distributed Machine Learning

被引:26
作者
Abdelmoniem, Ahmed M. [1 ]
Canini, Marco [1 ]
机构
[1] KAUST, Thuwal, Saudi Arabia
来源
IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2021) | 2021年
关键词
Machine Learning; Distributed Training; Delay-aware Control; Adaptive Gradient Compression;
D O I
10.1109/INFOCOM42981.2021.9488810
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Distributed training performs data-parallel training of DNN models which is a necessity for increasingly complex models and large datasets. Recent works are identifying major communication bottlenecks in distributed training. These works seek possible opportunities to speed-up the training in systems supporting distributed ML workloads. As communication reduction, compression techniques are proposed to speed up this communication phase. However, compression comes at the cost of reduced model accuracy, especially when compression is applied arbitrarily. Instead, we advocate a more controlled use of compression and propose DC2, a delay-aware compression control mechanism. DC2 couples compression control and network delays in applying compression adaptively. DC2 not only compensates for network variations but can also strike a better trade-off between training speed and accuracy. DC2 is implemented as a drop-in module to the communication library used by the ML toolkit and can operate in a variety of network settings. We empirically evaluate DC2 in network environments exhibiting low and high delay variations. Our evaluation of different popular CNN models and datasets shows that DC2 improves training speed-ups of up to 41x and 5.3x over baselines with no-compression and uniform compression, respectively.
引用
收藏
页数:10
相关论文
共 46 条
[1]  
Aji Alham Fikri, 2017, P 2017 C EMP METH NA, DOI DOI 10.18653/V1/D17-1045
[2]  
Alistarh D., 2018, NeurIPS
[3]  
Alistarh D, 2017, ADV NEUR IN, V30
[4]  
Anguelov D, 2015, ABS14094842 CORR
[5]  
Balles L, 2017, CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE (UAI2017)
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]   ANALYSIS OF THE INCREASE AND DECREASE ALGORITHMS FOR CONGESTION AVOIDANCE IN COMPUTER-NETWORKS [J].
CHIU, DM ;
JAIN, R .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1989, 17 (01) :1-14
[8]  
Dai W., 2019, ICLR
[9]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[10]  
Duchi J, 2011, J MACH LEARN RES, V12, P2121