FedArtML: A Tool to Facilitate the Generation of Non-IID Datasets in a Controlled Way to Support Federated Learning Research

被引:1
|
作者
Gutierrez, Daniel Mauricio Jimenez [1 ]
Anagnostopoulos, Aris [1 ]
Chatzigiannakis, Ioannis [1 ]
Vitaletti, Andrea [1 ]
机构
[1] Sapienza Univ Rome, Dept Comp Control & Management Engn, I-00185 Rome, Italy
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Data models; Measurement; Training; Data privacy; Systematics; Federated learning; Distributed databases; Machine learning; Centralized datasets; client's heterogeneity; federated datasets; federated learning; heterogeneity metrics; machine learning; non-IID-ness;
D O I
10.1109/ACCESS.2024.3410026
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Federated Learning (FL) enables collaborative training of Machine Learning (ML) models across decentralized clients while preserving data privacy. One of the challenges that FL faces is when the clients' data is not independent and identically distributed (non-IID). It is, therefore, crucial to quantify how non-IID data impacts performance. However, due to the limited number of federated data available, it is not easy to carry out real-world simulations. In this work, we propose for the first time 1) the Hist-Dirichlet-based and Min-Size-Dirichlet methods for partitioning data into multiple nodes using the features and quantity distribution and the Dirichlet distribution. We use the 2) Jensen-Shannon and Hellinger distances for quantifying the degree of IID data. Moreover, we implemented 3) state-of-the-art partitioning methods based on the labels' distribution across clients. All our proposals are open-source in a library called FedArtML, publicly available on PyPI. It facilitates research on cross-silo and cross-device FL, allowing a systematic and controlled partition of centralized datasets using the label, features, and quantity skewness. To demonstrate the value of our proposed methods and the robustness of FedArtML, we experimented in the ECG arrhythmia detection field with Physionet 2020 data. Our results demonstrate that our tool generates federated datasets for multi-client model training and accurately measures client distribution heterogeneity. Our approach achieves 48% higher non-IID-ness than existing feature skew methods, providing more granularity. Furthermore, we validate our simulated federated datasets against real-world data, revealing only a 2% F1-Score difference, affirming the method's real-life applicability.
引用
收藏
页码:81004 / 81016
页数:13
相关论文
共 50 条
  • [41] Non-IID Federated Learning With Sharper Risk Bound
    Wei, Bojian
    Li, Jian
    Liu, Yong
    Wang, Weiping
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (05) : 6906 - 6917
  • [42] Decoupled Federated Learning for ASR with Non-IID Data
    Zhu, Han
    Wang, Jindong
    Cheng, Gaofeng
    Zhang, Pengyuan
    Yan, Yonghong
    INTERSPEECH 2022, 2022, : 2628 - 2632
  • [43] Dynamic Clustering Federated Learning for Non-IID Data
    Chen, Ming
    Wu, Jinze
    Yin, Yu
    Huang, Zhenya
    Liu, Qi
    Chen, Enhong
    ARTIFICIAL INTELLIGENCE, CICAI 2022, PT III, 2022, 13606 : 119 - 131
  • [44] FedEL: Federated ensemble learning for non-iid data
    Wu, Xing
    Pei, Jie
    Han, Xian-Hua
    Chen, Yen-Wei
    Yao, Junfeng
    Liu, Yang
    Qian, Quan
    Guo, Yike
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 237
  • [45] Contractible Regularization for Federated Learning on Non-IID Data
    Chen, Zifan
    Wu, Zhe
    Wu, Xian
    Zhang, Li
    Zhao, Jie
    Yan, Yangtian
    Zheng, Yefeng
    2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, : 61 - 70
  • [46] Channel-Aware Joint AoI and Diversity Optimization for Client Scheduling in Federated Learning With Non-IID Datasets
    Ma, Manyou
    Wong, Vincent W. S.
    Schober, Robert
    IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2024, 23 (06) : 6295 - 6311
  • [47] Hypernetworks-Based Hierarchical Federated Learning on Hybrid Non-IID Datasets for Digital Twin in Industrial IoT
    Yang, Jihao
    Jiang, Wen
    Nie, Laisen
    IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2024, 11 (02): : 1413 - 1423
  • [48] Coalitional Federated Learning: Improving Communication and Training on Non-IID Data With Selfish Clients
    Arisdakessian, Sarhad
    Wahab, Omar Abdel
    Mourad, Azzam
    Otrok, Hadi
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2023, 16 (04) : 2462 - 2476
  • [49] FedGA: A greedy approach to enhance federated learning with Non-IID data
    Cong, Yue
    Zeng, Yuxiang
    Qiu, Jing
    Fang, Zhongyang
    Zhang, Lejun
    Cheng, Du
    Liu, Jia
    Tian, Zhihong
    KNOWLEDGE-BASED SYSTEMS, 2024, 301
  • [50] Long-Term Client Selection for Federated Learning With Non-IID Data: A Truthful Auction Approach
    Tan, Jinghong
    Liu, Zhian
    Guo, Kun
    Zhao, Mingxiong
    IEEE INTERNET OF THINGS JOURNAL, 2025, 12 (05): : 4953 - 4970