Accelerating Container-based Deep Learning Hyperparameter Optimization Workloads

被引：0

作者：

Liu, Rui ^{[1
]}

Wong, David ^{[2
]}

Lange, Dave ^{[2
]}

Larsson, Patrik ^{[2
]}

Jethava, Vinay ^{[2
]}

Zheng, Qing ^{[2
]}

机构：

[1] Univ Chicago, Chicago, IL 60637 USA

[2] DocuSign Inc, San Francisco, CA USA

来源：

PROCEEDINGS OF THE 6TH WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2022 | 2022年

关键词：

hyperparameter optimization; container; machine learning; deep learning; GPU;

D O I：

10.1145/3533028.3533309

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.

引用

页数：10

共 50 条

[1] Management of Container-based Genetic Algorithm Workloads over Cloud Infrastructure
Alrefai, Thamer
Indrusiak, Leandro Soares
17TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2020 (CF 2020), 2020, : 229 - 232
[2] A container-based workflow for distributed training of deep learning algorithms in HPC clusters
Gonzalez-Abad, Jose
Lopez Garcia, Alvaro
Kozlov, Valentin Y.
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2023, 26 (05): : 2815 - 2834
[3] An optimal defensive deception framework for the container-based cloud with deep reinforcement learning
Li, Huanruo
Guo, Yunfei
Sun, Penghao
Wang, Yawen
Huo, Shumin
IET INFORMATION SECURITY, 2022, 16 (03) : 178 - 192
[4] A container-based workflow for distributed training of deep learning algorithms in HPC clusters
Jose González-Abad
Álvaro López García
Valentin Y. Kozlov
Cluster Computing, 2023, 26 : 2815 - 2834
[5] A Performance Isolation Analysis of Disk-intensive Workloads on Container-based Clouds
Xavier, Miguel G.
De Oliveira, Israel C.
Rossi, Fabio D.
Dos Passos, Robson D.
Matteussi, Kassiano J.
De Rose, Cesar A. F.
23RD EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2015), 2015, : 253 - 260
[6] A priority-aware scheduling framework for heterogeneous workloads in container-based cloud
Zhu, Lilu
Huang, Kai
Fu, Kun
Hu, Yanfeng
Wang, Yang
APPLIED INTELLIGENCE, 2023, 53 (12) : 15222 - 15245
[7] A priority-aware scheduling framework for heterogeneous workloads in container-based cloud
Lilu Zhu
Kai Huang
Kun Fu
Yanfeng Hu
Yang Wang
Applied Intelligence, 2023, 53 : 15222 - 15245
[8] Optimization enabled deep learning method in container-based architecture of hybrid cloud for portability and interoperability-based application migration
Hiremath, Tej C.
Rekha, K. S.
JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2024, 36 (07) : 985 - 1002
[9] An Optimal Active Defensive Security Framework for the Container-Based Cloud with Deep Reinforcement Learning
Li, Yuanbo
Hu, Hongchao
Liu, Wenyan
Yang, Xiaohan
ELECTRONICS, 2023, 12 (07)
[10] Multi agent deep reinforcement learning for resource allocation in container-based clouds environments
Nagarajan, S.
Rani, P. Shobha
Vinmathi, M. S.
Reddy, V. Subba
Saleth, Angel Latha Mary
Subhahan, D. Abdus
EXPERT SYSTEMS, 2025, 42 (01)

← 1 2 3 4 5 →