LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning

被引:19
作者
Zhang, Jingjing [1 ]
Simeone, Osvaldo [1 ]
机构
[1] Kings Coll London, Dept Informat, London WC2R 2LS, England
基金
欧洲研究理事会;
关键词
Encoding; Redundancy; Standards; Servers; Computer architecture; Computational complexity; Robustness; Adaptive selection; coding; distributed learning; gradient descent (GD); grouping; ORDER-STATISTICS;
D O I
10.1109/TNNLS.2020.2979762
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Gradient-based distributed learning in parameter server (PS) computing architectures is subject to random delays due to straggling worker nodes and to possible communication bottlenecks between PS and workers. Solutions have been recently proposed to separately address these impairments based on the ideas of gradient coding (GC), worker grouping, and adaptive worker selection. This article provides a unified analysis of these techniques in terms of wall-clock time, communication, and computation complexity measures. Furthermore, in order to combine the benefits of GC and grouping in terms of robustness to stragglers with the communication and computation load gains of adaptive selection, novel strategies, named lazily aggregated GC (LAGC) and grouped-LAG (G-LAG), are introduced. Analysis and results show that G-LAG provides the best wall-clock time and communication performance while maintaining a low computational cost, for two representative distributions of the computing times of the worker nodes.
引用
收藏
页码:962 / 974
页数:13
相关论文
共 50 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
Alistarh D., 2017, ZIPML TRAINING LINEA, P4035
[3]  
Alistarh D, 2017, ADV NEUR IN, V30
[4]  
[Anonymous], 2018, ARXIV181011507
[5]  
[Anonymous], 2017, P INT C LEARN REPR I
[6]  
Arnold B.C., 2008, Classics in Applied Mathematics
[7]  
Barroso Luiz Andre, 2009, SYNTHESIS LECT COMPU
[8]  
Bernstein J., 2018, P 35 INT C MACH LEAR, P1709
[9]  
Bitar R., 2019, ARXIV190505383
[10]   Practical Secure Aggregation for Privacy-Preserving Machine Learning [J].
Bonawitz, Keith ;
Ivanov, Vladimir ;
Kreuter, Ben ;
Marcedone, Antonio ;
McMahan, H. Brendan ;
Patel, Sarvar ;
Ramage, Daniel ;
Segal, Aaron ;
Seth, Karn .
CCS'17: PROCEEDINGS OF THE 2017 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2017, :1175-1191