Accelerating Containerized Machine Learning Workloads

被引：0

作者：

Tariq, Ali ^{[1
,2
]}

Cao, Lianjie ^{[2
]}

Ahmed, Faraz ^{[2
]}

Rozner, Eric ^{[1
]}

Sharma, Puneet ^{[2
]}

机构：

[1] Univ Colorado, Boulder, CO 80309 USA

[2] Hewlett Packard Labs, Palo Alto, CA 94304 USA

来源：

PROCEEDINGS OF 2024 IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, NOMS 2024 | 2024年

关键词：

Machine Learning; Cloud Computing; Resource virtualization and management;

D O I：

10.1109/NOMS59830.2024.10575188

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

To facilitate various Machine Learning (ML) training and inference tasks, enterprises tend to build large and expensive clusters and share them among different teams for diverse ML workloads. Virtualized platforms (containers/VMs) and schedulers are typically deployed to allow such access, manage heterogeneous resources and schedule ML jobs in these clusters. However, allocating resource budgets for different ML jobs to achieve best performance and cluster resource efficiency remains a significant challenge. This work proposes NEARCHUS to accelerate distributed ML training while ensuring high resource efficiency by using adaptive resource allocation. NEARCHUS automatically identifies potential performance bottlenecks for running jobs and re-allocates resources to provide optimized run-time performance with high resource efficiency. NEARCHUS's resource configuration significantly improves the training speed of individual jobs up to 71.4%-129.1% against state-of-the-art resource schedulers, and reduces job completion and queuing time by 35.6% and 67.8%, respectively.

引用

页数：10

共 50 条

[21] Accelerating wavepacket propagation with machine learning
Singh, Kanishka
Lee, Ka Hei
Pelaez, Daniel
Bande, Annika
JOURNAL OF COMPUTATIONAL CHEMISTRY, 2024, 45 (28) : 2360 - 2373
[22] Machine Learning is Accelerating Materials Research
Zhang Q.
Zheng Y.
Sun K.
Cailiao Daobao/Materials Reports, 2020, 34 (05): : 9001 - 9002
[23] Accelerating Chip Design With Machine Learning
Khailany, Brucek
Ren, Haoxing
Dai, Steve
Godil, Saad
Keller, Ben
Kirby, Robert
Klinefelter, Alicia
Venkatesan, Rangharajan
Zhang, Yanqing
Catanzaro, Bryan
Dally, William J.
IEEE MICRO, 2020, 40 (06) : 23 - 32
[24] Machine learning is accelerating materials research
张起
郑玉杰
孙宽
材料导报, 2020, 34 (09) : 9001 - 9002
[25] DynamoML: Dynamic Resource Management Operators for Machine Learning Workloads
Chiang, Min-Chi
Chou, Jerry
CLOSER: PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2021, : 122 - 132
[26] Characterizing Multi-Instance GPU for Machine Learning Workloads
Li, Baolin
Gadepally, Viiay
Samsi, Siddharth
Tiwari, Devesh
2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2022), 2022, : 724 - 731
[27] A Cloud-Based Framework for Machine Learning Workloads and Applications
Lopez Garcia, Alvaro
Marco De Lucas, Jesus
Antonacci, Marica
Zu Castell, Wolfgang
David, Mario
Hardt, Marcus
Lloret Iglesias, Lara
Molto, German
Plociennik, Marcin
Viet Tran
Alic, Andy S.
Caballer, Miguel
Campos Plasencia, Isabel
Costantini, Alessandro
Dlugolinsky, Stefan
Duma, Doina Cristina
Donvito, Giacinto
Gomes, Jorge
Heredia Cacha, Ignacio
Ito, Keiichi
Kozlov, Valentin Y.
Giang Nguyen
Orviz Fernandez, Pablo
SUstr, Zdenek
Wolniewicz, Pawel
IEEE ACCESS, 2020, 8 (08): : 18681 - 18692
[28] UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads
Phani, Arnab
Erlbacher, Lukas
Boehm, Matthias
PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 15 (11): : 2929 - 2938
[29] Analyzing Machine Learning Workloads Using a Detailed GPU Simulator
Lew, Jonathan
Shah, Deval A.
Pati, Suchita
Cattell, Shaylin
Zhang, Mengchi
Sandhupatla, Amruth
Ng, Christopher
Goli, Negar
Sinclair, Matthew D.
Rogers, Timothy G.
Aamodt, Tor M.
2019 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS), 2019, : 151 - 152
[30] Dynamic GPU Energy Optimization for Machine Learning Training Workloads
Wang, Farui
Zhang, Weizhe
Lai, Shichao
Hao, Meng
Wang, Zheng
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (11) : 2943 - 2954

← 1 2 3 4 5 →