GuardGrid: a high-availability cloud platform for deep learning applications

被引：0

作者：

Yifan Sui ^{[1
]}

Meng Cai ^{[2
]}

Jianxun Li ^{[1
]}

机构：

[1] Department of Automation, Shanghai Jiao Tong University, Shanghai

[2] Luoyang Institute of Electro-Optical Equipment, AVIC, Luoyang

来源：

Cluster Computing | 2025年 / 28卷 / 5期

基金：

中国国家自然科学基金;

关键词：

Cloud computing; Fault-tolerant; High availability; Machine learning system;

D O I：

10.1007/s10586-024-04959-6

中图分类号：

学科分类号：

摘要：

With the development of cloud computing, training machine learning (ML) models on the cloud has become a hot topic. However, the memory-intensive nature of ML training applications places enormous pressure on nodes, easily causing node failure. While there are many works addressing fast recovery from failure, they fail to get the optimal recovery speed as they all ignore the unique characteristics of ML training. We observed that existing fault-tolerant solutions might intensify the out-of-memory (OOM) issue. Besides, they only focus on accelerating the node initialization speed, ignoring the dependency library, dataset, and model loading stages, which take much longer time than node initialization. In this paper, we propose GuardGrid, a fault-tolerant cloud platform that effectively avoiding OOM issues and accelerates ML training task’s recovery speed. It contains a proactive fault-tolerant mechanism that creates redundant nodes in advance for future failures based on both fault rate prediction and servers’ memory load. Besides, to speedup recovery and avoid introducing OOM issues by redundant nodes, we propose a reactive fault-tolerant mechanism that works out when the cluster’s memory load is high. Extensive experiments show that GuardGrid accelerates recovery speed up to 16.7× and reduces OOM rate up to 93%, compared with state-of-the-art methods. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.

引用

共 57 条

[1] Wang S., Li D., Cheng Y., Geng J., Wang Y., Wang S., Xia S., Wu J., A scalable, high-performance, and fault-tolerant network architecture for distributed machine learning, IEEE/ACM Trans. Netw, 28, 4, pp. 1752-1764, (2020)
[2] Guo J., Zheng X., Liu A., Liang S., Xiao Y., Wu Y., Liu X., Isolation and induction: training robust deep neural networks against model stealing attacks, Proceedings of the ACM International Conference on Multimedia (MM), 2023, pp. 4178-4189, (2023)
[3] Anser Y., Gaber C., Cajeat R., Wary J.-P., Bouzefrane S., Yacoub M., Kalinagac O., Gur G., Demonstrating liability and trust metrics for multi-actor, dynamic edge and cloud microservices, ACM MobiCom ’23: Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, (2023)
[4] Park K., Saur K., Banda D., Sen R., Interlandi M., Karanasos K., End-to-end optimization of machine learning prediction queries, Proceedings of the International Conference on Management of Data (SIGMOD), 2022, pp. 587-601, (2022)
[5] Nigenda D., Karnin Z., Zafar M.B., Ramesha R., Tan A., Donini M., Kenthapadi K., Amazon SageMaker model monitor: a system for real-time insights into deployed machine learning models, Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2022, pp. 3671-3681, (2022)
[6] Erickson P., Lee V.E., Shi F., Tang J., Efficient machine learning on large-scale graphs, Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2022, pp. 4788-4789, (2022)
[7] Chen W., Pi A., Wang S., Zhou X., OS-augmented oversubscription of opportunistic memory with a user-assisted OOM killer, In:, Pp. Proceedings of the 20Th International Middleware Conference, 201928–40, (2019)
[8] Kweun M., Kim G., Oh B., Jung S., Um T., Lee W.-Y., PokéMem: taming wild memory consumers in Apache Spark, 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022, pp. 59-69, (2022)
[9] Dean J., Corrado G., Monga R., Chen K., Devin M., Mao M., Ranzato M., Senior A., Tucker P., Yang K., Et al., Large scale distributed deep networks, Advances in Neural Information Processing Systems, 25, (2012)
[10] Hazelwood K., Bird S., Brooks D., Chintala S., Diril U., Dzhulgakov D., Fawzy M., Jia B., Jia Y., Kalro A., Et al., Applied machine learning at Facebook: A datacenter infrastructure perspective, In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018, pp. 620-629, (2018)

← 1 2 3 4 5 6 →