Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms

被引:15
|
作者
Cheng, Daning [1 ]
Li, Shigang [2 ]
Zhang, Hanping [3 ]
Xia, Fen [3 ]
Zhang, Yunquan [4 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, SKL, Beijing, Peoples R China
[2] Swiss Fed Inst Technol, Dept Comp Sci, Zh, Switzerland
[3] Beijing Wisdom Uranium Technol Co Ltd, Algorithm Dept, Beijing, Peoples R China
[4] Chinese Acad Sci, Inst Comp Technol, SKL Comp Architecture, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Scalability; Machine learning; Machine learning algorithms; Stochastic processes; Task analysis; Upper bound; Parallel training algorithms; training dataset; scalability; stochastic optimization methods;
D O I
10.1109/TPDS.2020.3048836
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As the training dataset size and the model size of machine learning increase rapidly, more computing resources are consumed to speedup the training process. However, the scalability and performance reproducibility of parallel machine learning training, which mainly uses stochastic optimization algorithms, are limited. In this paper, we demonstrate that the sample difference in the dataset plays a prominent role in the scalability of parallel machine learning algorithms. We propose to use statistical properties of dataset to measure sample differences. These properties include the variance of sample features, sample sparsity, sample diversity, and similarity in sampling sequences. We choose four types of parallel training algorithms as our research objects: (1) the asynchronous parallel SGD algorithm (Hogwild! algorithm), (2) the parallel model average SGD algorithm (minibatch SGD algorithm), (3) the decentralization optimization algorithm, and (4) the dual coordinate optimization (DADM algorithm). Our results show that the statistical properties of training datasets determine the scalability upper bound of these parallel training algorithms.
引用
收藏
页码:1702 / 1712
页数:11
相关论文
共 50 条
  • [41] Training and evaluating machine learning algorithms for ocean microplastics classification through vibrational spectroscopy
    Back, Henrique de Medeiros
    Junior, Edson Cilos Vargas
    Alarcon, Orestes Estevam
    Pottmaier, Daphiny
    CHEMOSPHERE, 2022, 287
  • [42] Reliability of probabilistic numerical data for training machine learning algorithms to detect damage in bridges
    Bud, Mihai Adrian
    Moldovan, Ionut
    Radu, Lucian
    Nedelcu, Mihai
    Figueiredo, Eloi
    STRUCTURAL CONTROL & HEALTH MONITORING, 2022, 29 (07)
  • [43] Machine learning assisted discovery of new thermoset shape memory polymers based on a small training dataset
    Yan, Cheng
    Feng, Xiaming
    Wick, Collin
    Peters, Andrew
    Li, Guoqiang
    POLYMER, 2021, 214
  • [44] A comparative study of a combinatorial machine learning approach to face detection using a very small training dataset
    Oyarzo Huichaqueo, Marco
    Magdaleno Maltas, Jordi
    2021 IEEE CHILEAN CONFERENCE ON ELECTRICAL, ELECTRONICS ENGINEERING, INFORMATION AND COMMUNICATION TECHNOLOGIES (IEEE CHILECON 2021), 2021, : 709 - 715
  • [45] Exploring the use of machine learning for interpreting electrochemical impedance spectroscopy data: evaluation of the training dataset size
    Bongiorno, V.
    Gibbon, S.
    Michailidou, E.
    Curioni, M.
    CORROSION SCIENCE, 2022, 198
  • [46] Assessing the diagnostic accuracy of machine learning algorithms for identification of asthma in United States adults based on NHANES dataset
    Gargari, Omid Kohandel
    Fathi, Mobina
    Firouzabadi, Shahryar Rajai
    Mohammadi, Ida
    Mahmoudi, Mohammad Hossein
    Sarmadi, Mehran
    Shafiee, Arman
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [47] A comparison of machine learning algorithms on design smell detection using balanced and imbalanced dataset: A study of God class
    Alkharabsheh, Khalid
    Alawadi, Sadi
    Kebande, Victor R.
    Crespo, Yania
    Fernandez-Delgado, Manuel
    Taboada, Jose A.
    INFORMATION AND SOFTWARE TECHNOLOGY, 2022, 143
  • [48] Reverse Engineering of Mechanical and Tribological Properties of Coatings: Results of Machine Learning Algorithms
    Pashkov, D. M.
    Belyak, O. A.
    Guda, A. A.
    Kolesnikov, V., I
    PHYSICAL MESOMECHANICS, 2022, 25 (04) : 296 - 305
  • [49] Why implementing machine learning algorithms in the clinic is not a plug-and-play solution: a simulation study of a machine learning algorithm for acute leukaemia subtype diagnosis
    Pucher, Gernot
    Rostalski, Till
    Nensa, Felix
    Kleesiek, Jens
    Reinhardt, Hans Christian
    Sauer, Christopher Martin
    EBIOMEDICINE, 2025, 111
  • [50] Optimizing machine learning algorithms for spatial prediction of gully erosion susceptibility with four training scenarios
    Liu, Guoqing
    Arabameri, Alireza
    Santosh, M.
    Nalivan, Omid Asadi
    ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH, 2023, 30 (16) : 46979 - 46996