Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

被引:30
作者
Li, Songze [1 ]
Avestimehr, Salman [1 ]
机构
[1] Univ Southern Calif, Los Angeles, CA 90007 USA
来源
FOUNDATIONS AND TRENDS IN COMMUNICATIONS AND INFORMATION THEORY | 2020年 / 17卷 / 01期
基金
美国国家科学基金会;
关键词
MATRIX MULTIPLICATION; PARALLEL; COMPUTATION; ALGORITHM; FRAMEWORK; SUM;
D O I
10.1561/0100000103
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We introduce the concept of "coded computing", a novel computing paradigm that utilizes coding theory to effectively inject and leverage data/computation redundancy to mitigate several fundamental bottlenecks in large-scale distributed computing, namely communication bandwidth, straggler's (i.e., slow or failing nodes) delay, privacy and security bottlenecks. More specifically, for MapReduce based distributed computing structures, we propose the "Coded Distributed Computing" (CDC) scheme, which injects redundant computations across the network in a structured manner, such that in-network coding opportunities are enabled to substantially slash the communication load to shuffle the intermediate computation results. We prove that CDC achieves the optimal tradeoff between computation and communication, and demonstrate its impact on a wide range of distributed computing systems from cloud-based datacenters to mobile edge/fog computing platforms. Secondly, to alleviate the straggler effect that prolongs the executions of distributed machine learning algorithms, we utilize the ideas from error correcting codes to develop "Polynomial Codes" for computing general matrix algebra, and "Lagrange Coded Computing" (LCC) for computing arbitrary multivariate polynomials. The core idea of these proposed schemes is to apply coding to create redundant data/computation scattered across the network, such that completing the overall computation task only requires a subset of the network nodes returning their local computation results. We demonstrate the optimality of Polynomial Codes and LCC in minimizing the computation latency, by proving that they require the least number of nodes to return their results. Finally, we illustrate the role of coded computing in providing security and privacy in distributed computing and machine learning. In particular, we consider the problems of secure multiparty computing (MPC) and privacy-preserving machine learning, and demonstrate how coded computing can be leveraged to provide efficient solutions to these critical problems and enable substantial improvements over the state of the art. To illustrate the impact of coded computing on real world applications and systems, we implement the proposed coding schemes on cloud-based distributed computing systems, and significantly improve the run-time performance of important benchmarks including distributed sorting, distributed training of regression models, and privacy-preserving training for image classification. Throughout this monograph, we also highlight numerous open problems and exciting research directions for future work on coded computing.
引用
收藏
页码:1 / 148
页数:148
相关论文
共 169 条
[1]  
Agarwal A, 2016, IEEE I C COMP INT CO, P241
[2]   Network information flow [J].
Ahlswede, R ;
Cai, N ;
Li, SYR ;
Yeung, RW .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2000, 46 (04) :1204-1216
[3]  
Ahmad Faraz., 2012, ACM SIGARCH Computer Architecture News, V40, P61
[4]  
Al-Fares Mohammad, 2010, PROC USENIX NSDI
[5]  
Alistarh D, 2017, ADV NEUR IN, V30
[6]  
Alpatov Philip., 1997, Supercomputing '97: Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (CDROM), P1, DOI DOI 10.1145/509593.509622
[7]  
Ananthanarayanan G., 2013, PROC 10 USENIX C NET, P185
[8]  
[Anonymous], 54 ALL C SEPT
[9]  
[Anonymous], 2017, L-shaped Connection for Glass Portal Frame Structural analysis and its application
[10]  
[Anonymous], 2018, LEARNING CODE MACHIN