DEEPVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

被引：0

作者：

Kim, Yoochan ^{[1
]}

Kim, Kihyun ^{[1
]}

Cho, Yonghyeon ^{[1
,4
]}

Kim, Jinwoo ^{[1
]}

Khan, Awais ^{[2
]}

Kang, Ki-Dong ^{[3
]}

An, Baik-Song ^{[3
]}

Cha, Myung-Hoon ^{[3
]}

Kim, Hong-Yeon ^{[3
]}

Kim, Youngjae ^{[1
]}

机构：

[1] Sogang Univ, Dept Comp Sci & Engn, Seoul, South Korea

[2] Oak Ridge Natl Lab, Oak Ridge, TN USA

[3] ETRI, Daejeon, South Korea

[4] LG Elect, Seoul, South Korea

来源：

2024 IEEE 24TH INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING, CCGRID 2024 | 2024年

关键词：

Cloud Computing; Distributed Deep Learning; Checkpoint-Restart;

D O I：

10.1109/CCGrid59990.2024.00034

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Distributed Deep Learning (DDL), as a paradigm, dictates the use of GPU-based clusters as the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However, the high cost of such resources makes them inaccessible to many users. Public cloud services, particularly Spot Virtual Machines (VMs), offer a cost-effective alternative, but their unpredictable availability poses a significant challenge to the crucial checkpointing process in DDL. To address this, we introduce DEEPVM, a novel solution that recommends cost-effective cluster configurations by intelligently balancing the use of Spot and On-Demand VMs. DEEPVM leverages a four-stage process that analyzes instance performance using the FLOPP (FLoating-point Operations Per Price) metric, performs architecture-level analysis with linear programming, and identifies the optimal configuration for the user-specific needs. Extensive simulations and real-world deployments in the AWS environment demonstrate that DEEPVM consistently outperforms other policies, reducing training costs and overall makespan. By enabling cost-effective checkpointing with Spot VMs, DEEPVM opens up DDL to a wider range of users and facilitates a more efficient training of complex DNNs.

引用

页码：227 / 235

页数：9

共 4 条

[1] Systematic Literature Review on Cost-Efficient Deep Learning
Klemetti, Antti
Raatikainen, Mikko
Myllyaho, Lalli
Mikkonen, Tommi
Nurminen, Jukka K.
IEEE ACCESS, 2023, 11 : 90158 - 90180
[2] Performance and Cost-Efficient Spark Job Scheduling Based on Deep Reinforcement Learning in Cloud Computing Environments
Islam, Muhammed Tawfiqul
Karunasekera, Shanika
Buyya, Rajkumar
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (07) : 1695 - 1710
[3] Self-managed cost-efficient virtual elastic clusters on hybrid Cloud infrastructures
Calatrava, Amanda
Romero, Eloy
Molto, German
Caballer, Miguel
Miguel Alonso, Jose
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2016, 61 : 13 - 25
[4] Optimizing Multi-Level Checkpointing for Distributed Deep Learning Workloads on Cloud Spot VM Clusters
Cho, Yonghyeon
Kim, Yoochan
Kim, Kihyun
Kim, Jinwoo
Kim, Hong-Yeon
Kim, Youngjae
IEEE ACCESS, 2024, 12 : 116891 - 116904

← 1 →