RubberBand: Cloud-based Hyperparameter Tuning

被引:14
作者
Misra, Ujval [1 ]
Liaw, Richard [1 ]
Dunlap, Lisa [1 ]
Bhardwaj, Romil [1 ]
Kandasamy, Kirthevasan [1 ]
Gonzalez, Joseph E. [1 ]
Stoica, Ion [1 ]
Tumanov, Alexey [2 ]
机构
[1] Univ Calif Berkeley, Berkeley, CA 94720 USA
[2] Georgia Inst Technol, Atlanta, GA 30332 USA
来源
PROCEEDINGS OF THE SIXTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS (EUROSYS '21) | 2021年
关键词
Hyperparameter Optimization; Distributed Machine Learning;
D O I
10.1145/3447786.3456245
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Hyperparameter tuning is essential to achieving state-of-the-art accuracy in machine learning (ML), but requires substantial compute resources to perform. Existing systems primarily focus on effectively allocating resources for a hyperparameter tuning job under fixed resource constraints. We show that the available parallelism in such jobs changes dynamically over the course of execution and, therefore, presents an opportunity to leverage the elasticity of the cloud. In particular, we address the problem of minimizing the financial cost of executing a hyperparameter tuning job, subject to a time constraint. We present RubberBand-the first framework for cost-efficient, elastic execution of hyperparameter tuning jobs in the cloud. RubberBand utilizes performance instrumentation and cloud pricing to model job completion time and cost prior to runtime, and generate a cost-efficient, elastic resource allocation plan. RubberBand is able to efficiently execute this plan and realize a cost reduction of up to 2x in comparison to static allocation baselines.
引用
收藏
页码:327 / 342
页数:16
相关论文
共 42 条
[1]   Optuna: A Next-generation Hyperparameter Optimization Framework [J].
Akiba, Takuya ;
Sano, Shotaro ;
Yanase, Toshihiko ;
Ohta, Takeru ;
Koyama, Masanori .
KDD'19: PROCEEDINGS OF THE 25TH ACM SIGKDD INTERNATIONAL CONFERENCCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2019, :2623-2631
[2]  
Alipourfard O, 2017, PROCEEDINGS OF NSDI '17: 14TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION, P469
[3]  
Brown TB, 2020, ADV NEUR IN, V33
[4]   CIRRUS: a Serverless Framework for End-to-end ML Workflows [J].
Carreira, Joao ;
Fonseca, Pedro ;
Tumanov, Alexey ;
Zhang, Andrew ;
Katz, Randy .
PROCEEDINGS OF THE 2019 TENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '19), 2019, :13-24
[5]   Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning [J].
Chaudhary, Shubham ;
Ramjee, Ramachandran ;
Sivathanu, Muthian ;
Kwatra, Nipun ;
Viswanatha, Srinidhi .
PROCEEDINGS OF THE FIFTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS (EUROSYS'20), 2020,
[6]   Stratus: cost-aware container scheduling in the public cloud [J].
Chung, Andrew ;
Park, Jun Woo ;
Ganger, Gregory R. .
PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18), 2018, :121-134
[7]   Analysis of dawnbench, a time-to-accuracy machine learning performance benchmark [J].
Coleman C. ;
Kang D. ;
Narayanan D. ;
Nardi L. ;
Zhao T. ;
Zhang J. ;
Bailis P. ;
Olukotun K. ;
Ré C. ;
Zaharia M. .
Operating Systems Review (ACM), 2019, 53 (01) :14-25
[8]   Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms [J].
Cortez, Eli ;
Bonde, Anand ;
Muzio, Alexandre ;
Russinovich, Mark ;
Fontoura, Marcus ;
Bianchini, Ricardo .
PROCEEDINGS OF THE TWENTY-SIXTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP '17), 2017, :153-167
[9]   Quasar: Resource-Efficient and QoS-Aware Cluster Management [J].
Delimitrou, Christina ;
Kozyrakis, Christos .
ACM SIGPLAN NOTICES, 2014, 49 (04) :127-143
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171