Efficient Scaling on GPU for Federated Learning in Kubernetes: A Reinforcement Learning Approach

被引：0

作者：

Bak, Charn-Doh ^{[1
]}

Han, Seung-Jae ^{[1
]}

机构：

[1] Yonsei Univ, Dept Comp Sci & Engn, Seoul, South Korea

来源：

2024 INTERNATIONAL TECHNICAL CONFERENCE ON CIRCUITS/SYSTEMS, COMPUTERS, AND COMMUNICATIONS, ITC-CSCC 2024 | 2024年

关键词：

Federated learning; Kubernetes cluster; GPU scaling; Resource efficiency;

D O I：

10.1109/ITC-CSCC62988.2024.10628145

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Distributed learning enables efficient training with large-scale data, allowing processing to occur in multiple locations without centralization. However, the "straggler problem", referring to delayed update due to performance heterogeneity, hinders efficient learning. GPUs are crucial for enhancing learning speed by efficiently processing large-scale data through parallel computing capabilities. Efficient GPU utilization, considering resource costs, is essential for reducing training time and costs. Meanwhile, Kubernetes offers scalability, management ease, and resource provisioning for efficient operations. In this paper, we propose a Kubernetes-based vertical scaling scheme to address FL's straggler problem. For GPU scaling, we formulate an optimization problem that considers both resource cost and learning speed. Then, we propose a DRL-based approach to address this problem. We also leverage various Kubernetes features as well as CUDA Multi-Process Service (MPS) for execution of vertical scaling. We validate the proposed scheme's performance through various evaluations on a real testbed.

引用

页数：6

共 50 条

[1] An efficient personalized federated learning approach in heterogeneous environments: a reinforcement learning perspective
Yang, Hongwei
Li, Juncheng
Hao, Meng
Zhang, Weizhe
He, Hui
Sangaiah, Arun Kumar
SCIENTIFIC REPORTS, 2024, 14 (01):
[2] RLSK: A Job Scheduler for Federated Kubernetes Clusters based on Reinforcement Learning
Huang, Jiaming
Xiao, Chuming
Wu, Weigang
2020 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E 2020), 2020, : 116 - 123
[3] Gwydion: Efficient auto-scaling for complex containerized applications in Kubernetes through Reinforcement Learning
Santos, Jose
Reppas, Efstratios
Wauters, Tim
Volckaert, Bruno
De Turck, Filip
JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2025, 234
[4] A Multi-Agent Reinforcement Learning Approach for Efficient Client Selection in Federated Learning
Zhang, Sai Qian
Lin, Jieyu
Zhang, Qi
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 9091 - 9099
[5] A Safe Deep Reinforcement Learning Approach for Energy Efficient Federated Learning in Wireless Communication Networks
Koursioumpas, Nikolaos
Magoula, Lina
Petropouleas, Nikolaos
Thanopoulos, Alexandros-Ioannis
Panagea, Theodora
Alonistioti, Nancy
Gutierrez-Estevez, M. A.
Khalili, Ramin
IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, 2024, 8 (04): : 1862 - 1874
[6] Heterogeneous Training Intensity for Federated Learning: A Deep Reinforcement Learning Approach
Zeng, Manying
Wang, Xiumin
Pan, Weijian
Zhou, Pan
IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2023, 10 (02): : 990 - 1002
[7] Distributed Dependent Task Offloading in CPU-GPU Heterogenous MEC: A Federated Reinforcement Learning Approach
Huang, Hualong
Duan, Zhekai
Zhan, Wenhan
Liu, Yichen
Wang, Zhi
Zhao, Zitian
2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 1206 - 1213
[8] Applicability of Deep Reinforcement Learning for Efficient Federated Learning in Massive IoT Communications
Tam, Prohim
Corrado, Riccardo
Eang, Chanthol
Kim, Seokhoon
APPLIED SCIENCES-BASEL, 2023, 13 (05):
[9] Design and Implementation of Kubernetes enabled Federated Learning Platform
Kim, Jingyeom
Kim, Doyeon
Lee, Joohyung
12TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC 2021): BEYOND THE PANDEMIC ERA WITH ICT CONVERGENCE INNOVATION, 2021, : 410 - 412
[10] Efficient Microservice Deployment in Kubernetes Multi-Clusters through Reinforcement Learning
Santos, Jose
Zaccarini, Mattia
Poltronieri, Filippo
Tortonesi, Mauro
Stefanelli, Cesare
Di Cicco, Nicola
de Turck, Filip
PROCEEDINGS OF 2024 IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, NOMS 2024, 2024,

← 1 2 3 4 5 →