Accelerating Container-based Deep Learning Hyperparameter Optimization Workloads

被引：0

作者：

Liu, Rui ^{[1
]}

Wong, David ^{[2
]}

Lange, Dave ^{[2
]}

Larsson, Patrik ^{[2
]}

Jethava, Vinay ^{[2
]}

Zheng, Qing ^{[2
]}

机构：

[1] Univ Chicago, Chicago, IL 60637 USA

[2] DocuSign Inc, San Francisco, CA USA

来源：

PROCEEDINGS OF THE 6TH WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2022 | 2022年

关键词：

hyperparameter optimization; container; machine learning; deep learning; GPU;

D O I：

10.1145/3533028.3533309

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.

引用

页数：10

共 50 条

[31] Accelerating Containerized Machine Learning Workloads
Tariq, Ali
Cao, Lianjie
Ahmed, Faraz
Rozner, Eric
Sharma, Puneet
PROCEEDINGS OF 2024 IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, NOMS 2024, 2024,
[32] Horizontal and Vertical Scaling of Container-based Applications using Reinforcement Learning
Rossi, Fabiana
Nardelli, Matteo
Cardellini, Valeria
2019 IEEE 12TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (IEEE CLOUD 2019), 2019, : 329 - 338
[33] Analyzing the Students' Learning within a Container-based Virtual Laboratory for Cybersecurity
Robles-Gomez, Antonio
Tobarra, Llanos
Pastor, Rafael
Hernandez, Roberto
Duque, Andres
Cano, Jesus
TEEM'19: SEVENTH INTERNATIONAL CONFERENCE ON TECHNOLOGICAL ECOSYSTEMS FOR ENHANCING MULTICULTURALITY, 2019, : 275 - 283
[34] A Parkinson's Auxiliary Diagnosis Algorithm Based on a Hyperparameter Optimization Method of Deep Learning
Wang, Xingbo
Li, Shujuan
Pun, Chi-Man
Guo, Yijing
Xu, Feng
Gao, Hao
Lu, Huimin
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2024, 21 (04) : 912 - 923
[35] Dynamical Hyperparameter Optimization via Deep Reinforcement Learning in Tracking
Dong, Xingping
Shen, Jianbing
Wang, Wenguan
Shao, Ling
Ling, Haibin
Porikli, Fatih
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (05) : 1515 - 1529
[36] A Deep Learning Based Fault Diagnosis Method With Hyperparameter Optimization by Using Parallel Computing
Guo, Chaozhong
Li, Lin
Hu, Yuanyuan
Yan, Jihong
IEEE ACCESS, 2020, 8 : 131248 - 131256
[37] Deep Learning Hyperparameter Optimization for Breast Mass Detection in Mammograms
Sehgal, Adarsh
Sehgal, Muskan
La, Hung Manh
Bebis, George
ADVANCES IN VISUAL COMPUTING, ISVC 2022, PT II, 2022, 13599 : 270 - 283
[38] Heuristic hyperparameter optimization of deep learning models for genomic prediction
Han, Junjie
Gondro, Cedric
Reid, Kenneth
Steibel, Juan P.
G3-GENES GENOMES GENETICS, 2021, 11 (07):
[39] Hyperparameter Optimization for Deep Reinforcement Learning in Vehicle Energy Management
Liessner, Roman
Schmitt, Jakob
Dietermann, Ansgar
Baeker, Bernard
PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE (ICAART), VOL 2, 2019, : 134 - 144
[40] ScaleReactor: A graceful performance isolation agent with interference detection and investigation for container-based scale-out workloads
Zhu, Jianyong
Hu, Chunming
Wo, Tianyu
Yu, Xiaoqiang
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (04):

← 1 2 3 4 5 →