Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

被引：50

作者：

Li, Songze ^{[1
]}

Avestimehr, Salman ^{[1
]}

机构：

[1] Univ Southern Calif, Los Angeles, CA 90007 USA

来源：

FOUNDATIONS AND TRENDS IN COMMUNICATIONS AND INFORMATION THEORY | 2020年 / 17卷 / 01期

基金：

美国国家科学基金会;

关键词：

MATRIX MULTIPLICATION; PARALLEL; COMPUTATION; ALGORITHM; FRAMEWORK; SUM;

D O I：

10.1561/0100000103

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

We introduce the concept of "coded computing", a novel computing paradigm that utilizes coding theory to effectively inject and leverage data/computation redundancy to mitigate several fundamental bottlenecks in large-scale distributed computing, namely communication bandwidth, straggler's (i.e., slow or failing nodes) delay, privacy and security bottlenecks. More specifically, for MapReduce based distributed computing structures, we propose the "Coded Distributed Computing" (CDC) scheme, which injects redundant computations across the network in a structured manner, such that in-network coding opportunities are enabled to substantially slash the communication load to shuffle the intermediate computation results. We prove that CDC achieves the optimal tradeoff between computation and communication, and demonstrate its impact on a wide range of distributed computing systems from cloud-based datacenters to mobile edge/fog computing platforms. Secondly, to alleviate the straggler effect that prolongs the executions of distributed machine learning algorithms, we utilize the ideas from error correcting codes to develop "Polynomial Codes" for computing general matrix algebra, and "Lagrange Coded Computing" (LCC) for computing arbitrary multivariate polynomials. The core idea of these proposed schemes is to apply coding to create redundant data/computation scattered across the network, such that completing the overall computation task only requires a subset of the network nodes returning their local computation results. We demonstrate the optimality of Polynomial Codes and LCC in minimizing the computation latency, by proving that they require the least number of nodes to return their results. Finally, we illustrate the role of coded computing in providing security and privacy in distributed computing and machine learning. In particular, we consider the problems of secure multiparty computing (MPC) and privacy-preserving machine learning, and demonstrate how coded computing can be leveraged to provide efficient solutions to these critical problems and enable substantial improvements over the state of the art. To illustrate the impact of coded computing on real world applications and systems, we implement the proposed coding schemes on cloud-based distributed computing systems, and significantly improve the run-time performance of important benchmarks including distributed sorting, distributed training of regression models, and privacy-preserving training for image classification. Throughout this monograph, we also highlight numerous open problems and exciting research directions for future work on coded computing.

引用

页码：1 / 148

页数：148

共 170 条

[1] Network information flow [J].

Ahlswede, R ;

Cai, N ;

Li, SYR ;

Yeung, RW .

IEEE TRANSACTIONS ON INFORMATION THEORY, 2000, 46 (04) :1204-1216

[2]

Ahmad F., 2012, ACM SIGARCH COMPUT A, V40, P61, DOI 10.1145/2189750.2150984

[3]

AKTAS MF, 2018, ACM SIGMETRICS PERFO, V45, P224

[4]

Al-Fares M, 2010, 7 USENIX S NETW SYST

[5]

Alistarh D., 2017, ADV NEURAL INFORM PR, P1707

[6]

Alpatov P, 1997, P 1997 ACM IEEE C SU, P1

[7]

Ananthanarayanan G. A., 2013, 10 USENIX S NETW SYS, P185

[8]

[Anonymous], 2016, IEEE INTERNET THINGS, DOI DOI 10.1109/JIOT.2016.2584538

[9]

[Anonymous], 2012, PROC VLDB ENDOW

[10]

[Anonymous], 2018, FOUND TRENDS COMMUN, DOI DOI 10.1561/0100000094

← 1 2 3 4 5 6 7 8 9 10 →