Efficient Scaling on GPU for Federated Learning in Kubernetes: A Reinforcement Learning Approach

被引:0
|
作者
Bak, Charn-Doh [1 ]
Han, Seung-Jae [1 ]
机构
[1] Yonsei Univ, Dept Comp Sci & Engn, Seoul, South Korea
来源
2024 INTERNATIONAL TECHNICAL CONFERENCE ON CIRCUITS/SYSTEMS, COMPUTERS, AND COMMUNICATIONS, ITC-CSCC 2024 | 2024年
关键词
Federated learning; Kubernetes cluster; GPU scaling; Resource efficiency;
D O I
10.1109/ITC-CSCC62988.2024.10628145
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Distributed learning enables efficient training with large-scale data, allowing processing to occur in multiple locations without centralization. However, the "straggler problem", referring to delayed update due to performance heterogeneity, hinders efficient learning. GPUs are crucial for enhancing learning speed by efficiently processing large-scale data through parallel computing capabilities. Efficient GPU utilization, considering resource costs, is essential for reducing training time and costs. Meanwhile, Kubernetes offers scalability, management ease, and resource provisioning for efficient operations. In this paper, we propose a Kubernetes-based vertical scaling scheme to address FL's straggler problem. For GPU scaling, we formulate an optimization problem that considers both resource cost and learning speed. Then, we propose a DRL-based approach to address this problem. We also leverage various Kubernetes features as well as CUDA Multi-Process Service (MPS) for execution of vertical scaling. We validate the proposed scheme's performance through various evaluations on a real testbed.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] An efficient personalized federated learning approach in heterogeneous environments: a reinforcement learning perspective
    Yang, Hongwei
    Li, Juncheng
    Hao, Meng
    Zhang, Weizhe
    He, Hui
    Sangaiah, Arun Kumar
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [2] RLSK: A Job Scheduler for Federated Kubernetes Clusters based on Reinforcement Learning
    Huang, Jiaming
    Xiao, Chuming
    Wu, Weigang
    2020 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E 2020), 2020, : 116 - 123
  • [3] Gwydion: Efficient auto-scaling for complex containerized applications in Kubernetes through Reinforcement Learning
    Santos, Jose
    Reppas, Efstratios
    Wauters, Tim
    Volckaert, Bruno
    De Turck, Filip
    JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2025, 234
  • [4] A Multi-Agent Reinforcement Learning Approach for Efficient Client Selection in Federated Learning
    Zhang, Sai Qian
    Lin, Jieyu
    Zhang, Qi
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 9091 - 9099
  • [5] A Safe Deep Reinforcement Learning Approach for Energy Efficient Federated Learning in Wireless Communication Networks
    Koursioumpas, Nikolaos
    Magoula, Lina
    Petropouleas, Nikolaos
    Thanopoulos, Alexandros-Ioannis
    Panagea, Theodora
    Alonistioti, Nancy
    Gutierrez-Estevez, M. A.
    Khalili, Ramin
    IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, 2024, 8 (04): : 1862 - 1874
  • [6] Heterogeneous Training Intensity for Federated Learning: A Deep Reinforcement Learning Approach
    Zeng, Manying
    Wang, Xiumin
    Pan, Weijian
    Zhou, Pan
    IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2023, 10 (02): : 990 - 1002
  • [7] Distributed Dependent Task Offloading in CPU-GPU Heterogenous MEC: A Federated Reinforcement Learning Approach
    Huang, Hualong
    Duan, Zhekai
    Zhan, Wenhan
    Liu, Yichen
    Wang, Zhi
    Zhao, Zitian
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 1206 - 1213
  • [8] Applicability of Deep Reinforcement Learning for Efficient Federated Learning in Massive IoT Communications
    Tam, Prohim
    Corrado, Riccardo
    Eang, Chanthol
    Kim, Seokhoon
    APPLIED SCIENCES-BASEL, 2023, 13 (05):
  • [9] Design and Implementation of Kubernetes enabled Federated Learning Platform
    Kim, Jingyeom
    Kim, Doyeon
    Lee, Joohyung
    12TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC 2021): BEYOND THE PANDEMIC ERA WITH ICT CONVERGENCE INNOVATION, 2021, : 410 - 412
  • [10] Efficient Microservice Deployment in Kubernetes Multi-Clusters through Reinforcement Learning
    Santos, Jose
    Zaccarini, Mattia
    Poltronieri, Filippo
    Tortonesi, Mauro
    Stefanelli, Cesare
    Di Cicco, Nicola
    de Turck, Filip
    PROCEEDINGS OF 2024 IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, NOMS 2024, 2024,