Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

被引:30
作者
Li, Songze [1 ]
Avestimehr, Salman [1 ]
机构
[1] Univ Southern Calif, Los Angeles, CA 90007 USA
来源
FOUNDATIONS AND TRENDS IN COMMUNICATIONS AND INFORMATION THEORY | 2020年 / 17卷 / 01期
基金
美国国家科学基金会;
关键词
MATRIX MULTIPLICATION; PARALLEL; COMPUTATION; ALGORITHM; FRAMEWORK; SUM;
D O I
10.1561/0100000103
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We introduce the concept of "coded computing", a novel computing paradigm that utilizes coding theory to effectively inject and leverage data/computation redundancy to mitigate several fundamental bottlenecks in large-scale distributed computing, namely communication bandwidth, straggler's (i.e., slow or failing nodes) delay, privacy and security bottlenecks. More specifically, for MapReduce based distributed computing structures, we propose the "Coded Distributed Computing" (CDC) scheme, which injects redundant computations across the network in a structured manner, such that in-network coding opportunities are enabled to substantially slash the communication load to shuffle the intermediate computation results. We prove that CDC achieves the optimal tradeoff between computation and communication, and demonstrate its impact on a wide range of distributed computing systems from cloud-based datacenters to mobile edge/fog computing platforms. Secondly, to alleviate the straggler effect that prolongs the executions of distributed machine learning algorithms, we utilize the ideas from error correcting codes to develop "Polynomial Codes" for computing general matrix algebra, and "Lagrange Coded Computing" (LCC) for computing arbitrary multivariate polynomials. The core idea of these proposed schemes is to apply coding to create redundant data/computation scattered across the network, such that completing the overall computation task only requires a subset of the network nodes returning their local computation results. We demonstrate the optimality of Polynomial Codes and LCC in minimizing the computation latency, by proving that they require the least number of nodes to return their results. Finally, we illustrate the role of coded computing in providing security and privacy in distributed computing and machine learning. In particular, we consider the problems of secure multiparty computing (MPC) and privacy-preserving machine learning, and demonstrate how coded computing can be leveraged to provide efficient solutions to these critical problems and enable substantial improvements over the state of the art. To illustrate the impact of coded computing on real world applications and systems, we implement the proposed coding schemes on cloud-based distributed computing systems, and significantly improve the run-time performance of important benchmarks including distributed sorting, distributed training of regression models, and privacy-preserving training for image classification. Throughout this monograph, we also highlight numerous open problems and exciting research directions for future work on coded computing.
引用
收藏
页码:1 / 148
页数:148
相关论文
共 169 条
[31]  
Blanchard P., 2017, C NEUR INF PROC SYST
[32]  
Blanchard P, 2017, PREPRINT
[33]  
Bogdanov D, 2008, LECT NOTES COMPUT SC, V5283, P192
[34]   Practical Secure Aggregation for Privacy-Preserving Machine Learning [J].
Bonawitz, Keith ;
Ivanov, Vladimir ;
Kreuter, Ben ;
Marcedone, Antonio ;
McMahan, H. Brendan ;
Patel, Sarvar ;
Ramage, Daniel ;
Segal, Aaron ;
Seth, Karn .
CCS'17: PROCEEDINGS OF THE 2017 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2017, :1175-1191
[35]  
Bonomi F., 2012, P 1 ED MCC WORKSH MO, P13, DOI [10.1145/2342509.2342513, DOI 10.1145/2342509.2342513]
[36]  
Charles Z, 2017, PREPRINT
[37]   Replicated Data Placement for Uncertain Scheduling [J].
Chaubey, Manmohan ;
Saule, Erik .
2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, 2015, :464-472
[38]  
Chen L., 2018, ARXIV180309877
[39]   Fog and IoT: An Overview of Research Opportunities [J].
Chiang, Mung ;
Zhang, Tao .
IEEE INTERNET OF THINGS JOURNAL, 2016, 3 (06) :854-864
[40]  
Chilimbi T., 2014, 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), P571