DEEPVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

被引:0
|
作者
Kim, Yoochan [1 ]
Kim, Kihyun [1 ]
Cho, Yonghyeon [1 ,4 ]
Kim, Jinwoo [1 ]
Khan, Awais [2 ]
Kang, Ki-Dong [3 ]
An, Baik-Song [3 ]
Cha, Myung-Hoon [3 ]
Kim, Hong-Yeon [3 ]
Kim, Youngjae [1 ]
机构
[1] Sogang Univ, Dept Comp Sci & Engn, Seoul, South Korea
[2] Oak Ridge Natl Lab, Oak Ridge, TN USA
[3] ETRI, Daejeon, South Korea
[4] LG Elect, Seoul, South Korea
来源
2024 IEEE 24TH INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING, CCGRID 2024 | 2024年
关键词
Cloud Computing; Distributed Deep Learning; Checkpoint-Restart;
D O I
10.1109/CCGrid59990.2024.00034
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Distributed Deep Learning (DDL), as a paradigm, dictates the use of GPU-based clusters as the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However, the high cost of such resources makes them inaccessible to many users. Public cloud services, particularly Spot Virtual Machines (VMs), offer a cost-effective alternative, but their unpredictable availability poses a significant challenge to the crucial checkpointing process in DDL. To address this, we introduce DEEPVM, a novel solution that recommends cost-effective cluster configurations by intelligently balancing the use of Spot and On-Demand VMs. DEEPVM leverages a four-stage process that analyzes instance performance using the FLOPP (FLoating-point Operations Per Price) metric, performs architecture-level analysis with linear programming, and identifies the optimal configuration for the user-specific needs. Extensive simulations and real-world deployments in the AWS environment demonstrate that DEEPVM consistently outperforms other policies, reducing training costs and overall makespan. By enabling cost-effective checkpointing with Spot VMs, DEEPVM opens up DDL to a wider range of users and facilitates a more efficient training of complex DNNs.
引用
收藏
页码:227 / 235
页数:9
相关论文
共 4 条
  • [1] Systematic Literature Review on Cost-Efficient Deep Learning
    Klemetti, Antti
    Raatikainen, Mikko
    Myllyaho, Lalli
    Mikkonen, Tommi
    Nurminen, Jukka K.
    IEEE ACCESS, 2023, 11 : 90158 - 90180
  • [2] Performance and Cost-Efficient Spark Job Scheduling Based on Deep Reinforcement Learning in Cloud Computing Environments
    Islam, Muhammed Tawfiqul
    Karunasekera, Shanika
    Buyya, Rajkumar
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (07) : 1695 - 1710
  • [3] Self-managed cost-efficient virtual elastic clusters on hybrid Cloud infrastructures
    Calatrava, Amanda
    Romero, Eloy
    Molto, German
    Caballer, Miguel
    Miguel Alonso, Jose
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2016, 61 : 13 - 25
  • [4] Optimizing Multi-Level Checkpointing for Distributed Deep Learning Workloads on Cloud Spot VM Clusters
    Cho, Yonghyeon
    Kim, Yoochan
    Kim, Kihyun
    Kim, Jinwoo
    Kim, Hong-Yeon
    Kim, Youngjae
    IEEE ACCESS, 2024, 12 : 116891 - 116904