Accelerating Container-based Deep Learning Hyperparameter Optimization Workloads

被引：0

作者：

Liu, Rui ^{[1
]}

Wong, David ^{[2
]}

Lange, Dave ^{[2
]}

Larsson, Patrik ^{[2
]}

Jethava, Vinay ^{[2
]}

Zheng, Qing ^{[2
]}

机构：

[1] Univ Chicago, Chicago, IL 60637 USA

[2] DocuSign Inc, San Francisco, CA USA

来源：

PROCEEDINGS OF THE 6TH WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2022 | 2022年

关键词：

hyperparameter optimization; container; machine learning; deep learning; GPU;

D O I：

10.1145/3533028.3533309

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.

引用

页数：10

共 50 条

[21] Hyperparameter Optimization of Deep Learning Models for EEG-Based Vigilance Detection
Khessiba, Souhir
Blaiech, Ahmed Ghazi
Manzanera, Antoine
Ben Khalifa, Khaled
Ben Abdallah, Asma
Bedoui, Mohamed Hedi
ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 1653 : 200 - 210
[22] Container-based bioinformatics with Pachyderm
Novella, Jon Ander
Emami Khoonsari, Payam
Herman, Stephanie
Whitenack, Daniel
Capuccini, Marco
Burman, Joachim
Kultima, Kim
Spjuth, Ola
BIOINFORMATICS, 2019, 35 (05) : 839 - 846
[23] Hyperparameter Optimization for Deep Residual Learning in Image Classification
Jafar, Abbas
Myungho, Lee
2020 IEEE INTERNATIONAL CONFERENCE ON AUTONOMIC COMPUTING AND SELF-ORGANIZING SYSTEMS COMPANION (ACSOS-C 2020), 2020, : 24 - 29
[24] Hyperparameter Optimization for Deep Learning-based Automatic Melanoma Diagnosis System
Nagaoka, Takashi
ADVANCED BIOMEDICAL ENGINEERING, 2020, 9 : 225 - 232
[25] Genetic Algorithm Based Deep Learning Neural Network Structure and Hyperparameter Optimization
Lee, Sanghyeop
Kim, Junyeob
Kang, Hyeon
Kang, Do-Young
Park, Jangsik
APPLIED SCIENCES-BASEL, 2021, 11 (02): : 1 - 12
[26] PriorBand: Practical Hyperparameter Optimization in the Age of Deep Learning
Mallik, Neeratyoy
Bergman, Edward
Hvarfner, Carl
Stoll, Danny
Janowski, Maciej
Lindauer, Marius
Nardi, Luigi
Hutter, Frank
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[27] ACCELERATING HYPERPARAMETER TUNING OF A DEEP LEARNING MODEL FOR REMOTE SENSING IMAGE CLASSIFICATION
Aach, Marcel
Sedona, Rocco
Lintermann, Andreas
Cavallaro, Gabriele
Neukirchen, Helmut
Riedel, Morris
2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 263 - 266
[28] A Load Balancing with Power Optimization Algorithm for Container-based Infrastructure Management
Hanafy, Walid A.
Mohamed, Amr E.
Salem, Sameh A.
2017 12TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND SYSTEMS (ICCES), 2017, : 161 - 166
[29] Preprocessing Pipeline Optimization for Scientific Deep Learning Workloads
Ibrahim, Khaled Z.
Oliker, Leonid
2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2022), 2022, : 1118 - 1128
[30] Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads
Abts, Dennis
Ross, Jonathan
Sparling, Jonathan
Wong-VanHaren, Mark
Baker, Max
Hawkins, Tom
Bell, Andrew
Thompson, John
Kahsai, Temesghen
Kimmell, Garrin
Hwang, Jennifer
Leslie-Hurd, Rebekah
Bye, Michael
Creswick, E. R.
Boyd, Matthew
Venigalla, Mahitha
Laforge, Evan
Purdy, Jon
Kamath, Purushotham
Maheshwari, Dinesh
Beidler, Michael
Rosseel, Geert
Ahmad, Omar
Gagarin, Gleb
Czekalski, Richard
Rane, Ashay
Parmar, Sahil
Werner, Jeff
Sproch, Jim
Macias, Adrian
Kurtz, Brian
2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, : 145 - 158

← 1 2 3 4 5 →