Accelerating Containerized Machine Learning Workloads

被引:0
|
作者
Tariq, Ali [1 ,2 ]
Cao, Lianjie [2 ]
Ahmed, Faraz [2 ]
Rozner, Eric [1 ]
Sharma, Puneet [2 ]
机构
[1] Univ Colorado, Boulder, CO 80309 USA
[2] Hewlett Packard Labs, Palo Alto, CA 94304 USA
关键词
Machine Learning; Cloud Computing; Resource virtualization and management;
D O I
10.1109/NOMS59830.2024.10575188
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
To facilitate various Machine Learning (ML) training and inference tasks, enterprises tend to build large and expensive clusters and share them among different teams for diverse ML workloads. Virtualized platforms (containers/VMs) and schedulers are typically deployed to allow such access, manage heterogeneous resources and schedule ML jobs in these clusters. However, allocating resource budgets for different ML jobs to achieve best performance and cluster resource efficiency remains a significant challenge. This work proposes NEARCHUS to accelerate distributed ML training while ensuring high resource efficiency by using adaptive resource allocation. NEARCHUS automatically identifies potential performance bottlenecks for running jobs and re-allocates resources to provide optimized run-time performance with high resource efficiency. NEARCHUS's resource configuration significantly improves the training speed of individual jobs up to 71.4%-129.1% against state-of-the-art resource schedulers, and reduces job completion and queuing time by 35.6% and 67.8%, respectively.
引用
收藏
页数:10
相关论文
共 50 条
  • [11] Kub: Enabling Elastic HPC Workloads on Containerized Environments
    Medeiros, Daniel
    Wahlgren, Jacob
    Schieffer, Gabin
    Peng, Ivy
    2023 IEEE 35TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING, SBAC-PAD, 2023, : 219 - 229
  • [12] An Intermediate Representation for Hybrid Database and Machine Learning Workloads
    Shaikhha, Amir
    Schleich, Maximilian
    Olteanu, Dan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (12): : 2831 - 2834
  • [13] On Optimizing Machine Learning Workloads via Kernel Fusion
    Ashari, Arash
    Tatikonda, Shirish
    Boehm, Matthias
    Reinwald, Berthold
    Campbell, Keith
    Keenleyside, John
    Sadayappan, P.
    ACM SIGPLAN NOTICES, 2015, 50 (08) : 173 - 182
  • [14] Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for in-Hardware Explicit Messaging
    Dogan, Halit
    Hijaz, Farrukh
    Ahmad, Masab
    Kahne, Brian
    Wilson, Peter
    Khan, Omer
    2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2017, : 254 - 264
  • [15] Characterization and Machine Learning Classification of AI and PC Workloads
    Sibai, Fadi N.
    Asaduzzaman, Abu
    El-Moursy, Ali
    IEEE ACCESS, 2024, 12 : 83858 - 83875
  • [16] Dynamic Resource Management for Machine Learning Pipeline Workloads
    Chiang M.-C.
    Zhang L.-W.
    Chou Y.-M.
    Chou J.
    SN Computer Science, 4 (5)
  • [17] Robust Resource Scaling of Containerized Microservices with Probabilistic Machine learning
    Kang, Peng
    Lama, Palden
    2020 IEEE/ACM 13TH INTERNATIONAL CONFERENCE ON UTILITY AND CLOUD COMPUTING (UCC 2020), 2020, : 122 - 131
  • [18] Accelerating catalysts design by machine learning
    Yu, Haishan
    Jiang, Jun
    SCIENCE BULLETIN, 2020, 65 (19) : 1593 - 1594
  • [19] Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads
    Abts, Dennis
    Ross, Jonathan
    Sparling, Jonathan
    Wong-VanHaren, Mark
    Baker, Max
    Hawkins, Tom
    Bell, Andrew
    Thompson, John
    Kahsai, Temesghen
    Kimmell, Garrin
    Hwang, Jennifer
    Leslie-Hurd, Rebekah
    Bye, Michael
    Creswick, E. R.
    Boyd, Matthew
    Venigalla, Mahitha
    Laforge, Evan
    Purdy, Jon
    Kamath, Purushotham
    Maheshwari, Dinesh
    Beidler, Michael
    Rosseel, Geert
    Ahmad, Omar
    Gagarin, Gleb
    Czekalski, Richard
    Rane, Ashay
    Parmar, Sahil
    Werner, Jeff
    Sproch, Jim
    Macias, Adrian
    Kurtz, Brian
    2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, : 145 - 158
  • [20] Accelerating Chip Design with Machine Learning
    Khailany, Brucek
    PROCEEDINGS OF THE 2020 ACM/IEEE 2ND WORKSHOP ON MACHINE LEARNING FOR CAD (MLCAD '20), 2020, : 33 - 33