GuardGrid: a high-availability cloud platform for deep learning applications

被引:0
作者
Yifan Sui [1 ]
Meng Cai [2 ]
Jianxun Li [1 ]
机构
[1] Department of Automation, Shanghai Jiao Tong University, Shanghai
[2] Luoyang Institute of Electro-Optical Equipment, AVIC, Luoyang
基金
中国国家自然科学基金;
关键词
Cloud computing; Fault-tolerant; High availability; Machine learning system;
D O I
10.1007/s10586-024-04959-6
中图分类号
学科分类号
摘要
With the development of cloud computing, training machine learning (ML) models on the cloud has become a hot topic. However, the memory-intensive nature of ML training applications places enormous pressure on nodes, easily causing node failure. While there are many works addressing fast recovery from failure, they fail to get the optimal recovery speed as they all ignore the unique characteristics of ML training. We observed that existing fault-tolerant solutions might intensify the out-of-memory (OOM) issue. Besides, they only focus on accelerating the node initialization speed, ignoring the dependency library, dataset, and model loading stages, which take much longer time than node initialization. In this paper, we propose GuardGrid, a fault-tolerant cloud platform that effectively avoiding OOM issues and accelerates ML training task’s recovery speed. It contains a proactive fault-tolerant mechanism that creates redundant nodes in advance for future failures based on both fault rate prediction and servers’ memory load. Besides, to speedup recovery and avoid introducing OOM issues by redundant nodes, we propose a reactive fault-tolerant mechanism that works out when the cluster’s memory load is high. Extensive experiments show that GuardGrid accelerates recovery speed up to 16.7× and reduces OOM rate up to 93%, compared with state-of-the-art methods. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
引用
收藏
相关论文
共 57 条
  • [11] Liu S., Pattabiraman K., Moscibroda T., Zorn B.G., Flikker: saving DRAM refresh-power through critical data partitioning, Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011, pp. 213-224, (2011)
  • [12] Verbraeken J., Wolting M., Katzy J., Kloppenburg J., Verbelen T., Rellermeyer J.S., A survey on distributed machine learning, ACM Comput. Surv, 53, 2, pp. 1-33, (2020)
  • [13] Liu J., Huang J., Zhou Y., Li X., Ji S., Xiong H., Dou D., From distributed machine learning to federated learning: a survey, Knowl. Inf. Syst, 64, 4, pp. 885-917, (2022)
  • [14] Ben-Nun T., Hoefler T., Demystifying parallel and distributed deep learning: an in-depth concurrency analysis, ACM Comput. Surv, 52, 4, pp. 1-43, (2019)
  • [15] Ray B.K., Saha A., Khatua S., Roy S., Proactive fault-tolerance technique to enhance reliability of cloud service in cloud federation environment, IEEE Trans. Cloud Comput, 10, 2, pp. 957-971, (2020)
  • [16] Talwar B., Arora A., Bharany S., An energy efficient agent aware proactive fault tolerance for preventing deterioration of virtual machines within cloud environment, In: 2021 9Th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), pp. 1-7, (2021)
  • [17] Tuli S., Casale G., Jennings N.R., PreGAN: preemptive migration prediction network for proactive fault-tolerant edge computing, IEEE INFOCOM 2022—IEEE Conference on Computer Communications, pp. 670-679, (2022)
  • [18] Bandari V., Proactive fault tolerance through cloud failure prediction using machine learning, ResearchBerg Rev. Sci. Technol, 3, 1, pp. 51-65, (2020)
  • [19] Pabitha P., Sandheep N., Nivitha K., Praveen R., Proactive fault prediction and tolerance in cloud computing, In: Workshop on Mining Data for Financial Applications, pp. 527-550, (2022)
  • [20] Cheng Z., Tang L., Huang Q., Lee P.P., Enabling low-redundancy proactive fault tolerance for stream machine learning via erasure coding, In: 2021 40Th International Symposium on Reliable Distributed Systems (SRDS), 2021, pp. 99-108, (2021)