Accelerating Container-based Deep Learning Hyperparameter Optimization Workloads

被引:0
|
作者
Liu, Rui [1 ]
Wong, David [2 ]
Lange, Dave [2 ]
Larsson, Patrik [2 ]
Jethava, Vinay [2 ]
Zheng, Qing [2 ]
机构
[1] Univ Chicago, Chicago, IL 60637 USA
[2] DocuSign Inc, San Francisco, CA USA
关键词
hyperparameter optimization; container; machine learning; deep learning; GPU;
D O I
10.1145/3533028.3533309
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Hyperparameter Optimization of Deep Learning Models for EEG-Based Vigilance Detection
    Khessiba, Souhir
    Blaiech, Ahmed Ghazi
    Manzanera, Antoine
    Ben Khalifa, Khaled
    Ben Abdallah, Asma
    Bedoui, Mohamed Hedi
    ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 1653 : 200 - 210
  • [22] Container-based bioinformatics with Pachyderm
    Novella, Jon Ander
    Emami Khoonsari, Payam
    Herman, Stephanie
    Whitenack, Daniel
    Capuccini, Marco
    Burman, Joachim
    Kultima, Kim
    Spjuth, Ola
    BIOINFORMATICS, 2019, 35 (05) : 839 - 846
  • [23] Hyperparameter Optimization for Deep Residual Learning in Image Classification
    Jafar, Abbas
    Myungho, Lee
    2020 IEEE INTERNATIONAL CONFERENCE ON AUTONOMIC COMPUTING AND SELF-ORGANIZING SYSTEMS COMPANION (ACSOS-C 2020), 2020, : 24 - 29
  • [25] Genetic Algorithm Based Deep Learning Neural Network Structure and Hyperparameter Optimization
    Lee, Sanghyeop
    Kim, Junyeob
    Kang, Hyeon
    Kang, Do-Young
    Park, Jangsik
    APPLIED SCIENCES-BASEL, 2021, 11 (02): : 1 - 12
  • [26] PriorBand: Practical Hyperparameter Optimization in the Age of Deep Learning
    Mallik, Neeratyoy
    Bergman, Edward
    Hvarfner, Carl
    Stoll, Danny
    Janowski, Maciej
    Lindauer, Marius
    Nardi, Luigi
    Hutter, Frank
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [27] ACCELERATING HYPERPARAMETER TUNING OF A DEEP LEARNING MODEL FOR REMOTE SENSING IMAGE CLASSIFICATION
    Aach, Marcel
    Sedona, Rocco
    Lintermann, Andreas
    Cavallaro, Gabriele
    Neukirchen, Helmut
    Riedel, Morris
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 263 - 266
  • [28] A Load Balancing with Power Optimization Algorithm for Container-based Infrastructure Management
    Hanafy, Walid A.
    Mohamed, Amr E.
    Salem, Sameh A.
    2017 12TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND SYSTEMS (ICCES), 2017, : 161 - 166
  • [29] Preprocessing Pipeline Optimization for Scientific Deep Learning Workloads
    Ibrahim, Khaled Z.
    Oliker, Leonid
    2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2022), 2022, : 1118 - 1128
  • [30] Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads
    Abts, Dennis
    Ross, Jonathan
    Sparling, Jonathan
    Wong-VanHaren, Mark
    Baker, Max
    Hawkins, Tom
    Bell, Andrew
    Thompson, John
    Kahsai, Temesghen
    Kimmell, Garrin
    Hwang, Jennifer
    Leslie-Hurd, Rebekah
    Bye, Michael
    Creswick, E. R.
    Boyd, Matthew
    Venigalla, Mahitha
    Laforge, Evan
    Purdy, Jon
    Kamath, Purushotham
    Maheshwari, Dinesh
    Beidler, Michael
    Rosseel, Geert
    Ahmad, Omar
    Gagarin, Gleb
    Czekalski, Richard
    Rane, Ashay
    Parmar, Sahil
    Werner, Jeff
    Sproch, Jim
    Macias, Adrian
    Kurtz, Brian
    2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, : 145 - 158