CIRRUS: a Serverless Framework for End-to-end ML Workflows

被引：135

作者：

Carreira, Joao ^{[1
]}

Fonseca, Pedro ^{[2
]}

Tumanov, Alexey ^{[3
]}

Zhang, Andrew ^{[1
]}

Katz, Randy ^{[1
]}

机构：

[1] Univ Calif Berkeley, Berkeley, CA 94720 USA

[2] Purdue Univ, W Lafayette, IN 47907 USA

[3] Georgia Inst Technol, Atlanta, GA 30332 USA

来源：

PROCEEDINGS OF THE 2019 TENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '19) | 2019年

关键词：

Serverless; Distributed Computing; Machine Learning;

D O I：

10.1145/3357223.3362711

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Machine learning (ML) workflows are extremely complex. The typical workflow consists of distinct stages of user interaction, such as preprocessing, training, and tuning, that are repeatedly executed by users but have heterogeneous computational requirements. This complexity makes it challenging for ML users to correctly provision and manage resources and, in practice, constitutes a significant burden that frequently causes over-provisioning and impairs user productivity. Serverless computing is a compelling model to address the resource management problem, in general, but there are numerous challenges to adopt it for existing ML frameworks due to significant restrictions on local resources. This work proposes CIRRUS an ML framework that automates the end-to-end management of datacenter resources for ML workflows by efficiently taking advantage of serverless infrastructures. CIRRUS combines the simplicity of the serverless interface and the scalability of the serverless infrastructure (AWS Lambdas and S3) to minimize user effort. We show a design specialized for both serverless computation and iterative ML training is needed for robust and efficient ML training on serverless infrastructure. Our evaluation shows that CIRRUS outperforms frameworks specialized along a single dimension: CIRRUS is 100X faster than a general purpose serverless system [36] and 3.75x faster than specialized ML frameworks for traditional infrastructures [49].

引用

页码：13 / 24

页数：12

共 47 条

[1]

Agarwal A, 2014, J MACH LEARN RES, V15, P1111

[2] Designing Far Memory Data Structures: Think Outside the Box [J].

Aguilera, Marcos K. ;

Keeton, Kimberly ;

Novakovic, Stanko ;

Singhal, Sharad .

PROCEEDINGS OF THE WORKSHOP ON HOT TOPICS IN OPERATING SYSTEMS (HOTOS '19), 2019, :120-126

[3]

Akkus IE, 2018, PROCEEDINGS OF THE 2018 USENIX ANNUAL TECHNICAL CONFERENCE, P923

[4]

[Anonymous], TensorFlow: a system for large-scale machine learning

[5]

[Anonymous], 2016, MULTIVERSO

[6]

[Anonymous], 2017, HUAWEI DC 3 0

[7]

[Anonymous], 2016, AZURE FUNCTIONS

[8]

[Anonymous], 2013, DISAGGREGATED RACK

[9]

[Anonymous], 2018, GOOGLE CLOUDML SAMPL

[10]

[Anonymous], 2018, ALIBABA FUNCTIONS

← 1 2 3 4 5 →