Accelerating Containerized Machine Learning Workloads

被引:0
|
作者
Tariq, Ali [1 ,2 ]
Cao, Lianjie [2 ]
Ahmed, Faraz [2 ]
Rozner, Eric [1 ]
Sharma, Puneet [2 ]
机构
[1] Univ Colorado, Boulder, CO 80309 USA
[2] Hewlett Packard Labs, Palo Alto, CA 94304 USA
关键词
Machine Learning; Cloud Computing; Resource virtualization and management;
D O I
10.1109/NOMS59830.2024.10575188
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To facilitate various Machine Learning (ML) training and inference tasks, enterprises tend to build large and expensive clusters and share them among different teams for diverse ML workloads. Virtualized platforms (containers/VMs) and schedulers are typically deployed to allow such access, manage heterogeneous resources and schedule ML jobs in these clusters. However, allocating resource budgets for different ML jobs to achieve best performance and cluster resource efficiency remains a significant challenge. This work proposes NEARCHUS to accelerate distributed ML training while ensuring high resource efficiency by using adaptive resource allocation. NEARCHUS automatically identifies potential performance bottlenecks for running jobs and re-allocates resources to provide optimized run-time performance with high resource efficiency. NEARCHUS's resource configuration significantly improves the training speed of individual jobs up to 71.4%-129.1% against state-of-the-art resource schedulers, and reduces job completion and queuing time by 35.6% and 67.8%, respectively.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Merged Logic and Memory Fabrics for Accelerating Machine Learning Workloads
    Crafton, Brian
    Spetalnick, Samuel
    Fang, Yan
    Raychowdhury, Arijit
    IEEE DESIGN & TEST, 2021, 38 (01) : 39 - 68
  • [2] PIM-DRAM: Accelerating Machine Learning Workloads Using Processing in Commodity DRAM
    Roy, Sourjya
    Ali, Mustafa
    Raghunathan, Anand
    IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2021, 11 (04) : 701 - 710
  • [3] Rack Level Scheduling for Containerized Workloads
    Xu, Qiumin
    Malladi, Krishna T.
    Awasthi, Manu
    2017 INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE, AND STORAGE (NAS), 2017, : 286 - 287
  • [4] Adapting Containerized Workloads for the Continuum Computing
    Robles-Enciso, Alberto
    Skarmeta, Antonio F.
    IEEE ACCESS, 2024, 12 : 104102 - 104114
  • [5] Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads
    Zhou, Qinghua
    Anthony, Quentin
    Shafi, Aamir
    Subramoni, Hari
    Panda, Dhabaleswar K.
    2022 IEEE 29TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, HIPC, 2022, : 22 - 31
  • [6] Optimizing Machine Learning Workloads in Collaborative Environments
    Derakhshan, Behrouz
    Mahdiraji, Alireza Rezaei
    Abedjan, Ziawasch
    Rabl, Tilmann
    Markl, Volker
    SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, : 1701 - 1716
  • [7] Accelerating Container-based Deep Learning Hyperparameter Optimization Workloads
    Liu, Rui
    Wong, David
    Lange, Dave
    Larsson, Patrik
    Jethava, Vinay
    Zheng, Qing
    PROCEEDINGS OF THE 6TH WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2022, 2022,
  • [8] Characterizing and Balancing the Workloads of Semi-Containerized Clouds
    Zhao, Shang
    Xue, Shuai
    Chen, Quan
    Guo, Minyi
    2019 IEEE 25TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2019, : 145 - 148
  • [9] On-node resource manager for containerized HPC workloads
    Vallee, Geoffroy
    Arango Gutierrez, Carlos Eduardo
    Clerget, Cedric
    PROCEEDINGS OF CANOPIE-HPC 2019:2019 IEEE/ACM 1ST INTERNATIONAL WORKSHOP ON CONTAINERS AND NEW ORCHESTRATION PARADIGMS FOR ISOLATED ENVIRONMENTS IN HPC (CANOPIE-HPC), 2019, : 43 - 48
  • [10] Rule-based Security Monitoring of Containerized Workloads
    Gantikow, Holger
    Reich, Christoph
    Knahl, Martin
    Clarke, Nathan
    CLOSER: PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2019, : 543 - 550