Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms

被引：15

作者：

Cheng, Daning ^{[1
]}

Li, Shigang ^{[2
]}

Zhang, Hanping ^{[3
]}

Xia, Fen ^{[3
]}

Zhang, Yunquan ^{[4
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, SKL, Beijing, Peoples R China

[2] Swiss Fed Inst Technol, Dept Comp Sci, Zh, Switzerland

[3] Beijing Wisdom Uranium Technol Co Ltd, Algorithm Dept, Beijing, Peoples R China

[4] Chinese Acad Sci, Inst Comp Technol, SKL Comp Architecture, Beijing, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2021年 / 32卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Training; Scalability; Machine learning; Machine learning algorithms; Stochastic processes; Task analysis; Upper bound; Parallel training algorithms; training dataset; scalability; stochastic optimization methods;

D O I：

10.1109/TPDS.2020.3048836

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

As the training dataset size and the model size of machine learning increase rapidly, more computing resources are consumed to speedup the training process. However, the scalability and performance reproducibility of parallel machine learning training, which mainly uses stochastic optimization algorithms, are limited. In this paper, we demonstrate that the sample difference in the dataset plays a prominent role in the scalability of parallel machine learning algorithms. We propose to use statistical properties of dataset to measure sample differences. These properties include the variance of sample features, sample sparsity, sample diversity, and similarity in sampling sequences. We choose four types of parallel training algorithms as our research objects: (1) the asynchronous parallel SGD algorithm (Hogwild! algorithm), (2) the parallel model average SGD algorithm (minibatch SGD algorithm), (3) the decentralization optimization algorithm, and (4) the dual coordinate optimization (DADM algorithm). Our results show that the statistical properties of training datasets determine the scalability upper bound of these parallel training algorithms.

引用

页码：1702 / 1712

页数：11

共 50 条

[41] Training and evaluating machine learning algorithms for ocean microplastics classification through vibrational spectroscopy
Back, Henrique de Medeiros
Junior, Edson Cilos Vargas
Alarcon, Orestes Estevam
Pottmaier, Daphiny
CHEMOSPHERE, 2022, 287
[42] Reliability of probabilistic numerical data for training machine learning algorithms to detect damage in bridges
Bud, Mihai Adrian
Moldovan, Ionut
Radu, Lucian
Nedelcu, Mihai
Figueiredo, Eloi
STRUCTURAL CONTROL & HEALTH MONITORING, 2022, 29 (07)
[43] Machine learning assisted discovery of new thermoset shape memory polymers based on a small training dataset
Yan, Cheng
Feng, Xiaming
Wick, Collin
Peters, Andrew
Li, Guoqiang
POLYMER, 2021, 214
[44] A comparative study of a combinatorial machine learning approach to face detection using a very small training dataset
Oyarzo Huichaqueo, Marco
Magdaleno Maltas, Jordi
2021 IEEE CHILEAN CONFERENCE ON ELECTRICAL, ELECTRONICS ENGINEERING, INFORMATION AND COMMUNICATION TECHNOLOGIES (IEEE CHILECON 2021), 2021, : 709 - 715
[45] Exploring the use of machine learning for interpreting electrochemical impedance spectroscopy data: evaluation of the training dataset size
Bongiorno, V.
Gibbon, S.
Michailidou, E.
Curioni, M.
CORROSION SCIENCE, 2022, 198
[46] Assessing the diagnostic accuracy of machine learning algorithms for identification of asthma in United States adults based on NHANES dataset
Gargari, Omid Kohandel
Fathi, Mobina
Firouzabadi, Shahryar Rajai
Mohammadi, Ida
Mahmoudi, Mohammad Hossein
Sarmadi, Mehran
Shafiee, Arman
SCIENTIFIC REPORTS, 2025, 15 (01):
[47] A comparison of machine learning algorithms on design smell detection using balanced and imbalanced dataset: A study of God class
Alkharabsheh, Khalid
Alawadi, Sadi
Kebande, Victor R.
Crespo, Yania
Fernandez-Delgado, Manuel
Taboada, Jose A.
INFORMATION AND SOFTWARE TECHNOLOGY, 2022, 143
[48] Reverse Engineering of Mechanical and Tribological Properties of Coatings: Results of Machine Learning Algorithms
Pashkov, D. M.
Belyak, O. A.
Guda, A. A.
Kolesnikov, V., I
PHYSICAL MESOMECHANICS, 2022, 25 (04) : 296 - 305
[49] Why implementing machine learning algorithms in the clinic is not a plug-and-play solution: a simulation study of a machine learning algorithm for acute leukaemia subtype diagnosis
Pucher, Gernot
Rostalski, Till
Nensa, Felix
Kleesiek, Jens
Reinhardt, Hans Christian
Sauer, Christopher Martin
EBIOMEDICINE, 2025, 111
[50] Optimizing machine learning algorithms for spatial prediction of gully erosion susceptibility with four training scenarios
Liu, Guoqing
Arabameri, Alireza
Santosh, M.
Nalivan, Omid Asadi
ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH, 2023, 30 (16) : 46979 - 46996

← 1 2 3 4 5 →