Accelerating Container-based Deep Learning Hyperparameter Optimization Workloads

被引:0
|
作者
Liu, Rui [1 ]
Wong, David [2 ]
Lange, Dave [2 ]
Larsson, Patrik [2 ]
Jethava, Vinay [2 ]
Zheng, Qing [2 ]
机构
[1] Univ Chicago, Chicago, IL 60637 USA
[2] DocuSign Inc, San Francisco, CA USA
关键词
hyperparameter optimization; container; machine learning; deep learning; GPU;
D O I
10.1145/3533028.3533309
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Management of Container-based Genetic Algorithm Workloads over Cloud Infrastructure
    Alrefai, Thamer
    Indrusiak, Leandro Soares
    17TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2020 (CF 2020), 2020, : 229 - 232
  • [2] A container-based workflow for distributed training of deep learning algorithms in HPC clusters
    Gonzalez-Abad, Jose
    Lopez Garcia, Alvaro
    Kozlov, Valentin Y.
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2023, 26 (05): : 2815 - 2834
  • [3] An optimal defensive deception framework for the container-based cloud with deep reinforcement learning
    Li, Huanruo
    Guo, Yunfei
    Sun, Penghao
    Wang, Yawen
    Huo, Shumin
    IET INFORMATION SECURITY, 2022, 16 (03) : 178 - 192
  • [4] A container-based workflow for distributed training of deep learning algorithms in HPC clusters
    Jose González-Abad
    Álvaro López García
    Valentin Y. Kozlov
    Cluster Computing, 2023, 26 : 2815 - 2834
  • [5] A Performance Isolation Analysis of Disk-intensive Workloads on Container-based Clouds
    Xavier, Miguel G.
    De Oliveira, Israel C.
    Rossi, Fabio D.
    Dos Passos, Robson D.
    Matteussi, Kassiano J.
    De Rose, Cesar A. F.
    23RD EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2015), 2015, : 253 - 260
  • [6] A priority-aware scheduling framework for heterogeneous workloads in container-based cloud
    Zhu, Lilu
    Huang, Kai
    Fu, Kun
    Hu, Yanfeng
    Wang, Yang
    APPLIED INTELLIGENCE, 2023, 53 (12) : 15222 - 15245
  • [7] A priority-aware scheduling framework for heterogeneous workloads in container-based cloud
    Lilu Zhu
    Kai Huang
    Kun Fu
    Yanfeng Hu
    Yang Wang
    Applied Intelligence, 2023, 53 : 15222 - 15245
  • [8] Optimization enabled deep learning method in container-based architecture of hybrid cloud for portability and interoperability-based application migration
    Hiremath, Tej C.
    Rekha, K. S.
    JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2024, 36 (07) : 985 - 1002
  • [9] An Optimal Active Defensive Security Framework for the Container-Based Cloud with Deep Reinforcement Learning
    Li, Yuanbo
    Hu, Hongchao
    Liu, Wenyan
    Yang, Xiaohan
    ELECTRONICS, 2023, 12 (07)
  • [10] Multi agent deep reinforcement learning for resource allocation in container-based clouds environments
    Nagarajan, S.
    Rani, P. Shobha
    Vinmathi, M. S.
    Reddy, V. Subba
    Saleth, Angel Latha Mary
    Subhahan, D. Abdus
    EXPERT SYSTEMS, 2025, 42 (01)