Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms

被引:15
|
作者
Cheng, Daning [1 ]
Li, Shigang [2 ]
Zhang, Hanping [3 ]
Xia, Fen [3 ]
Zhang, Yunquan [4 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, SKL, Beijing, Peoples R China
[2] Swiss Fed Inst Technol, Dept Comp Sci, Zh, Switzerland
[3] Beijing Wisdom Uranium Technol Co Ltd, Algorithm Dept, Beijing, Peoples R China
[4] Chinese Acad Sci, Inst Comp Technol, SKL Comp Architecture, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Scalability; Machine learning; Machine learning algorithms; Stochastic processes; Task analysis; Upper bound; Parallel training algorithms; training dataset; scalability; stochastic optimization methods;
D O I
10.1109/TPDS.2020.3048836
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As the training dataset size and the model size of machine learning increase rapidly, more computing resources are consumed to speedup the training process. However, the scalability and performance reproducibility of parallel machine learning training, which mainly uses stochastic optimization algorithms, are limited. In this paper, we demonstrate that the sample difference in the dataset plays a prominent role in the scalability of parallel machine learning algorithms. We propose to use statistical properties of dataset to measure sample differences. These properties include the variance of sample features, sample sparsity, sample diversity, and similarity in sampling sequences. We choose four types of parallel training algorithms as our research objects: (1) the asynchronous parallel SGD algorithm (Hogwild! algorithm), (2) the parallel model average SGD algorithm (minibatch SGD algorithm), (3) the decentralization optimization algorithm, and (4) the dual coordinate optimization (DADM algorithm). Our results show that the statistical properties of training datasets determine the scalability upper bound of these parallel training algorithms.
引用
收藏
页码:1702 / 1712
页数:11
相关论文
共 50 条
  • [1] Empirical Analysis on Cancer Dataset with Machine Learning Algorithms
    Vital, T. PanduRanga
    Krishna, M. Murali
    Narayana, G. V. L.
    Suneel, P.
    Ramarao, P.
    SOFT COMPUTING IN DATA ANALYTICS, SCDA 2018, 2019, 758 : 789 - 801
  • [2] Automated Dataset Generation for Training Peer-to-Peer Machine Learning Classifiers
    Roozbeh Zarei
    Alireza Monemi
    Muhammad Nadzir Marsono
    Journal of Network and Systems Management, 2015, 23 : 89 - 110
  • [3] Automated Dataset Generation for Training Peer-to-Peer Machine Learning Classifiers
    Zarei, Roozbeh
    Monemi, Alireza
    Marsono, Muhammad Nadzir
    JOURNAL OF NETWORK AND SYSTEMS MANAGEMENT, 2015, 23 (01) : 89 - 110
  • [4] Dynamic Feature Dataset for Ransomware Detection Using Machine Learning Algorithms
    Herrera-Silva, Juan A.
    Hernandez-alvarez, Myriam
    SENSORS, 2023, 23 (03)
  • [5] Applying Machine Learning Algorithms on Urban Heat Island (UHI) Dataset
    Shafi, Mujtaba
    Jain, Amit
    Zaman, Majid
    INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING AND COMMUNICATIONS, ICICC 2022, VOL 3, 2023, 492 : 725 - 732
  • [6] Application of Machine Learning Algorithms for Analyzing Sentiments using Twitter dataset
    Rewatkar, Bhairavi
    Barhate, Aditya
    Verma, Prateek
    2ND INTERNATIONAL CONFERENCE ON SUSTAINABLE COMPUTING AND SMART SYSTEMS, ICSCSS 2024, 2024, : 1392 - 1397
  • [7] Comparative Study of Machine Learning Algorithms using a Breast Cancer Dataset
    El-Shair, Zaid A.
    Sanchez-Perez, Luis A.
    Rawashdeh, Samir A.
    2020 IEEE INTERNATIONAL CONFERENCE ON ELECTRO INFORMATION TECHNOLOGY (EIT), 2020, : 500 - 508
  • [8] Performance evaluation of the machine learning algorithms for emotion classification on the CASE dataset
    Yildiz, Emre Rifat
    Bitirim, Yiltan
    PAMUKKALE UNIVERSITY JOURNAL OF ENGINEERING SCIENCES-PAMUKKALE UNIVERSITESI MUHENDISLIK BILIMLERI DERGISI, 2025, 31 (01): : 79 - 85
  • [9] On the Scalability of Machine-Learning Algorithms for Breast Cancer Prediction in Big Data Context
    Alghunaim, Sara
    Al-Baity, Heyam H.
    IEEE ACCESS, 2019, 7 : 91535 - 91546
  • [10] Survey on Parallel and Distributed Optimization Algorithms for Scalable Machine Learning
    Kang L.-Y.
    Wang J.-F.
    Liu J.
    Ye D.
    Ruan Jian Xue Bao/Journal of Software, 2018, 29 (01): : 109 - 130