Accelerating Containerized Machine Learning Workloads

被引:0
|
作者
Tariq, Ali [1 ,2 ]
Cao, Lianjie [2 ]
Ahmed, Faraz [2 ]
Rozner, Eric [1 ]
Sharma, Puneet [2 ]
机构
[1] Univ Colorado, Boulder, CO 80309 USA
[2] Hewlett Packard Labs, Palo Alto, CA 94304 USA
关键词
Machine Learning; Cloud Computing; Resource virtualization and management;
D O I
10.1109/NOMS59830.2024.10575188
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To facilitate various Machine Learning (ML) training and inference tasks, enterprises tend to build large and expensive clusters and share them among different teams for diverse ML workloads. Virtualized platforms (containers/VMs) and schedulers are typically deployed to allow such access, manage heterogeneous resources and schedule ML jobs in these clusters. However, allocating resource budgets for different ML jobs to achieve best performance and cluster resource efficiency remains a significant challenge. This work proposes NEARCHUS to accelerate distributed ML training while ensuring high resource efficiency by using adaptive resource allocation. NEARCHUS automatically identifies potential performance bottlenecks for running jobs and re-allocates resources to provide optimized run-time performance with high resource efficiency. NEARCHUS's resource configuration significantly improves the training speed of individual jobs up to 71.4%-129.1% against state-of-the-art resource schedulers, and reduces job completion and queuing time by 35.6% and 67.8%, respectively.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Accelerating wavepacket propagation with machine learning
    Singh, Kanishka
    Lee, Ka Hei
    Pelaez, Daniel
    Bande, Annika
    JOURNAL OF COMPUTATIONAL CHEMISTRY, 2024, 45 (28) : 2360 - 2373
  • [22] Machine Learning is Accelerating Materials Research
    Zhang Q.
    Zheng Y.
    Sun K.
    Cailiao Daobao/Materials Reports, 2020, 34 (05): : 9001 - 9002
  • [23] Accelerating Chip Design With Machine Learning
    Khailany, Brucek
    Ren, Haoxing
    Dai, Steve
    Godil, Saad
    Keller, Ben
    Kirby, Robert
    Klinefelter, Alicia
    Venkatesan, Rangharajan
    Zhang, Yanqing
    Catanzaro, Bryan
    Dally, William J.
    IEEE MICRO, 2020, 40 (06) : 23 - 32
  • [24] Machine learning is accelerating materials research
    张起
    郑玉杰
    孙宽
    材料导报, 2020, 34 (09) : 9001 - 9002
  • [25] DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads
    Chiang, Min-Chi
    Chou, Jerry
    CLOSER: PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2021, : 122 - 132
  • [26] Characterizing Multi-Instance GPU for Machine Learning Workloads
    Li, Baolin
    Gadepally, Viiay
    Samsi, Siddharth
    Tiwari, Devesh
    2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2022), 2022, : 724 - 731
  • [27] A Cloud-Based Framework for Machine Learning Workloads and Applications
    Lopez Garcia, Alvaro
    Marco De Lucas, Jesus
    Antonacci, Marica
    Zu Castell, Wolfgang
    David, Mario
    Hardt, Marcus
    Lloret Iglesias, Lara
    Molto, German
    Plociennik, Marcin
    Viet Tran
    Alic, Andy S.
    Caballer, Miguel
    Campos Plasencia, Isabel
    Costantini, Alessandro
    Dlugolinsky, Stefan
    Duma, Doina Cristina
    Donvito, Giacinto
    Gomes, Jorge
    Heredia Cacha, Ignacio
    Ito, Keiichi
    Kozlov, Valentin Y.
    Giang Nguyen
    Orviz Fernandez, Pablo
    SUstr, Zdenek
    Wolniewicz, Pawel
    IEEE ACCESS, 2020, 8 (08): : 18681 - 18692
  • [28] UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads
    Phani, Arnab
    Erlbacher, Lukas
    Boehm, Matthias
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 15 (11): : 2929 - 2938
  • [29] Analyzing Machine Learning Workloads Using a Detailed GPU Simulator
    Lew, Jonathan
    Shah, Deval A.
    Pati, Suchita
    Cattell, Shaylin
    Zhang, Mengchi
    Sandhupatla, Amruth
    Ng, Christopher
    Goli, Negar
    Sinclair, Matthew D.
    Rogers, Timothy G.
    Aamodt, Tor M.
    2019 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS), 2019, : 151 - 152
  • [30] Dynamic GPU Energy Optimization for Machine Learning Training Workloads
    Wang, Farui
    Zhang, Weizhe
    Lai, Shichao
    Hao, Meng
    Wang, Zheng
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (11) : 2943 - 2954