Accelerating Containerized Machine Learning Workloads

被引：0

作者：

Tariq, Ali ^{[1
,2
]}

Cao, Lianjie ^{[2
]}

Ahmed, Faraz ^{[2
]}

Rozner, Eric ^{[1
]}

Sharma, Puneet ^{[2
]}

机构：

[1] Univ Colorado, Boulder, CO 80309 USA

[2] Hewlett Packard Labs, Palo Alto, CA 94304 USA

来源：

PROCEEDINGS OF 2024 IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, NOMS 2024 | 2024年

关键词：

Machine Learning; Cloud Computing; Resource virtualization and management;

D O I：

10.1109/NOMS59830.2024.10575188

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

To facilitate various Machine Learning (ML) training and inference tasks, enterprises tend to build large and expensive clusters and share them among different teams for diverse ML workloads. Virtualized platforms (containers/VMs) and schedulers are typically deployed to allow such access, manage heterogeneous resources and schedule ML jobs in these clusters. However, allocating resource budgets for different ML jobs to achieve best performance and cluster resource efficiency remains a significant challenge. This work proposes NEARCHUS to accelerate distributed ML training while ensuring high resource efficiency by using adaptive resource allocation. NEARCHUS automatically identifies potential performance bottlenecks for running jobs and re-allocates resources to provide optimized run-time performance with high resource efficiency. NEARCHUS's resource configuration significantly improves the training speed of individual jobs up to 71.4%-129.1% against state-of-the-art resource schedulers, and reduces job completion and queuing time by 35.6% and 67.8%, respectively.

引用

页数：10

共 50 条

[11] Kub: Enabling Elastic HPC Workloads on Containerized Environments
Medeiros, Daniel
Wahlgren, Jacob
Schieffer, Gabin
Peng, Ivy
2023 IEEE 35TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING, SBAC-PAD, 2023, : 219 - 229
[12] An Intermediate Representation for Hybrid Database and Machine Learning Workloads
Shaikhha, Amir
Schleich, Maximilian
Olteanu, Dan
PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (12): : 2831 - 2834
[13] On Optimizing Machine Learning Workloads via Kernel Fusion
Ashari, Arash
Tatikonda, Shirish
Boehm, Matthias
Reinwald, Berthold
Campbell, Keith
Keenleyside, John
Sadayappan, P.
ACM SIGPLAN NOTICES, 2015, 50 (08) : 173 - 182
[14] Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for in-Hardware Explicit Messaging
Dogan, Halit
Hijaz, Farrukh
Ahmad, Masab
Kahne, Brian
Wilson, Peter
Khan, Omer
2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2017, : 254 - 264
[15] Characterization and Machine Learning Classification of AI and PC Workloads
Sibai, Fadi N.
Asaduzzaman, Abu
El-Moursy, Ali
IEEE ACCESS, 2024, 12 : 83858 - 83875
[16] Dynamic Resource Management for Machine Learning Pipeline Workloads
Chiang M.-C.
Zhang L.-W.
Chou Y.-M.
Chou J.
SN Computer Science, 4 (5)
[17] Robust Resource Scaling of Containerized Microservices with Probabilistic Machine learning
Kang, Peng
Lama, Palden
2020 IEEE/ACM 13TH INTERNATIONAL CONFERENCE ON UTILITY AND CLOUD COMPUTING (UCC 2020), 2020, : 122 - 131
[18] Accelerating catalysts design by machine learning
Yu, Haishan
Jiang, Jun
SCIENCE BULLETIN, 2020, 65 (19) : 1593 - 1594
[19] Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads
Abts, Dennis
Ross, Jonathan
Sparling, Jonathan
Wong-VanHaren, Mark
Baker, Max
Hawkins, Tom
Bell, Andrew
Thompson, John
Kahsai, Temesghen
Kimmell, Garrin
Hwang, Jennifer
Leslie-Hurd, Rebekah
Bye, Michael
Creswick, E. R.
Boyd, Matthew
Venigalla, Mahitha
Laforge, Evan
Purdy, Jon
Kamath, Purushotham
Maheshwari, Dinesh
Beidler, Michael
Rosseel, Geert
Ahmad, Omar
Gagarin, Gleb
Czekalski, Richard
Rane, Ashay
Parmar, Sahil
Werner, Jeff
Sproch, Jim
Macias, Adrian
Kurtz, Brian
2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, : 145 - 158
[20] Accelerating Chip Design with Machine Learning
Khailany, Brucek
PROCEEDINGS OF THE 2020 ACM/IEEE 2ND WORKSHOP ON MACHINE LEARNING FOR CAD (MLCAD '20), 2020, : 33 - 33

← 1 2 3 4 5 →