Accelerating Container-based Deep Learning Hyperparameter Optimization Workloads

被引:0
|
作者
Liu, Rui [1 ]
Wong, David [2 ]
Lange, Dave [2 ]
Larsson, Patrik [2 ]
Jethava, Vinay [2 ]
Zheng, Qing [2 ]
机构
[1] Univ Chicago, Chicago, IL 60637 USA
[2] DocuSign Inc, San Francisco, CA USA
关键词
hyperparameter optimization; container; machine learning; deep learning; GPU;
D O I
10.1145/3533028.3533309
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Accelerating Containerized Machine Learning Workloads
    Tariq, Ali
    Cao, Lianjie
    Ahmed, Faraz
    Rozner, Eric
    Sharma, Puneet
    PROCEEDINGS OF 2024 IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, NOMS 2024, 2024,
  • [32] Horizontal and Vertical Scaling of Container-based Applications using Reinforcement Learning
    Rossi, Fabiana
    Nardelli, Matteo
    Cardellini, Valeria
    2019 IEEE 12TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (IEEE CLOUD 2019), 2019, : 329 - 338
  • [33] Analyzing the Students' Learning within a Container-based Virtual Laboratory for Cybersecurity
    Robles-Gomez, Antonio
    Tobarra, Llanos
    Pastor, Rafael
    Hernandez, Roberto
    Duque, Andres
    Cano, Jesus
    TEEM'19: SEVENTH INTERNATIONAL CONFERENCE ON TECHNOLOGICAL ECOSYSTEMS FOR ENHANCING MULTICULTURALITY, 2019, : 275 - 283
  • [34] A Parkinson's Auxiliary Diagnosis Algorithm Based on a Hyperparameter Optimization Method of Deep Learning
    Wang, Xingbo
    Li, Shujuan
    Pun, Chi-Man
    Guo, Yijing
    Xu, Feng
    Gao, Hao
    Lu, Huimin
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2024, 21 (04) : 912 - 923
  • [35] Dynamical Hyperparameter Optimization via Deep Reinforcement Learning in Tracking
    Dong, Xingping
    Shen, Jianbing
    Wang, Wenguan
    Shao, Ling
    Ling, Haibin
    Porikli, Fatih
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (05) : 1515 - 1529
  • [36] A Deep Learning Based Fault Diagnosis Method With Hyperparameter Optimization by Using Parallel Computing
    Guo, Chaozhong
    Li, Lin
    Hu, Yuanyuan
    Yan, Jihong
    IEEE ACCESS, 2020, 8 : 131248 - 131256
  • [37] Deep Learning Hyperparameter Optimization for Breast Mass Detection in Mammograms
    Sehgal, Adarsh
    Sehgal, Muskan
    La, Hung Manh
    Bebis, George
    ADVANCES IN VISUAL COMPUTING, ISVC 2022, PT II, 2022, 13599 : 270 - 283
  • [38] Heuristic hyperparameter optimization of deep learning models for genomic prediction
    Han, Junjie
    Gondro, Cedric
    Reid, Kenneth
    Steibel, Juan P.
    G3-GENES GENOMES GENETICS, 2021, 11 (07):
  • [39] Hyperparameter Optimization for Deep Reinforcement Learning in Vehicle Energy Management
    Liessner, Roman
    Schmitt, Jakob
    Dietermann, Ansgar
    Baeker, Bernard
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE (ICAART), VOL 2, 2019, : 134 - 144
  • [40] ScaleReactor: A graceful performance isolation agent with interference detection and investigation for container-based scale-out workloads
    Zhu, Jianyong
    Hu, Chunming
    Wo, Tianyu
    Yu, Xiaoqiang
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (04):