Accelerating Container-based Deep Learning Hyperparameter Optimization Workloads

被引:0
|
作者
Liu, Rui [1 ]
Wong, David [2 ]
Lange, Dave [2 ]
Larsson, Patrik [2 ]
Jethava, Vinay [2 ]
Zheng, Qing [2 ]
机构
[1] Univ Chicago, Chicago, IL 60637 USA
[2] DocuSign Inc, San Francisco, CA USA
关键词
hyperparameter optimization; container; machine learning; deep learning; GPU;
D O I
10.1145/3533028.3533309
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] A Hybrid Sparrow Search Algorithm of the Hyperparameter Optimization in Deep Learning
    Fan, Yanyan
    Zhang, Yu
    Guo, Baosu
    Luo, Xiaoyuan
    Peng, Qingjin
    Jin, Zhenlin
    MATHEMATICS, 2022, 10 (16)
  • [42] Hyperparameter Optimization for Tracking with Continuous Deep Q-Learning
    Dong, Xingping
    Shen, Jianbing
    Wang, Wenguan
    Liu, Yu
    Shao, Ling
    Porikli, Fatih
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 518 - 527
  • [43] Serverless computing for container-based architectures
    Perez, Alfonso
    Molto, German
    Caballer, Miguel
    Calatrava, Amanda
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 83 : 50 - 59
  • [44] Deep Learning-Based Maximum Temperature Forecasting Assisted with Meta-Learning for Hyperparameter Optimization
    Tran, Trang Thi Kieu
    Lee, Taesam
    Shin, Ju-Young
    Kim, Jong-Suk
    Kamruzzaman, Mohamad
    ATMOSPHERE, 2020, 11 (05)
  • [45] An Agile Container-based Approach to TaaS
    Verdugo, Pedro
    Salvachua, Joaquin
    Huecas, Gabriel
    2017 56TH FITCE CONGRESS, 2017, : 10 - 15
  • [46] Research on Cross-Media Retrieval of Collaborative Plotted Multimedia Data Based on Container-Based Cloud Platform and Deep Learning
    Xie, Xiaolan
    Zheng, Qiangqing
    Li, Xinrong
    Cheng, Xiaochun
    Guo, Zhihong
    COMPUTER SUPPORTED COOPERATIVE WORK AND SOCIAL COMPUTING, CHINESECSCW 2018, 2019, 917 : 410 - 423
  • [47] Container-based virtual elastic clusters
    de Alfonso, Carlos
    Calatrava, Amanda
    Molto, German
    JOURNAL OF SYSTEMS AND SOFTWARE, 2017, 127 : 1 - 11
  • [48] Container-Based Platform for Computational Medicine
    Pezzullo, Gennaro, Jr.
    Di Martino, Beniamino
    Bubak, Marian
    ADVANCED INFORMATION NETWORKING AND APPLICATIONS, AINA-2022, VOL 3, 2022, 451 : 131 - 140
  • [49] Container-based Video Streaming Service
    Vidiecan, Matus
    Bobak, Martin
    2022 IEEE 22ND INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS AND 8TH IEEE INTERNATIONAL CONFERENCE ON RECENT ACHIEVEMENTS IN MECHATRONICS, AUTOMATION, COMPUTER SCIENCE AND ROBOTICS (CINTI-MACRO), 2022, : 191 - 196
  • [50] DSEOM: A Framework for Dynamic Security Evaluation and Optimization of MTD in Container-Based Cloud
    Jin, Hai
    Li, Zhi
    Zou, Deqing
    Yuan, Bin
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2021, 18 (03) : 1125 - 1136