Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms

被引：15

作者：

Cheng, Daning ^{[1
]}

Li, Shigang ^{[2
]}

Zhang, Hanping ^{[3
]}

Xia, Fen ^{[3
]}

Zhang, Yunquan ^{[4
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, SKL, Beijing, Peoples R China

[2] Swiss Fed Inst Technol, Dept Comp Sci, Zh, Switzerland

[3] Beijing Wisdom Uranium Technol Co Ltd, Algorithm Dept, Beijing, Peoples R China

[4] Chinese Acad Sci, Inst Comp Technol, SKL Comp Architecture, Beijing, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2021年 / 32卷 / 07期

基金：

中国国家自然科学基金;

关键词：

Training; Scalability; Machine learning; Machine learning algorithms; Stochastic processes; Task analysis; Upper bound; Parallel training algorithms; training dataset; scalability; stochastic optimization methods;

D O I：

10.1109/TPDS.2020.3048836

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

As the training dataset size and the model size of machine learning increase rapidly, more computing resources are consumed to speedup the training process. However, the scalability and performance reproducibility of parallel machine learning training, which mainly uses stochastic optimization algorithms, are limited. In this paper, we demonstrate that the sample difference in the dataset plays a prominent role in the scalability of parallel machine learning algorithms. We propose to use statistical properties of dataset to measure sample differences. These properties include the variance of sample features, sample sparsity, sample diversity, and similarity in sampling sequences. We choose four types of parallel training algorithms as our research objects: (1) the asynchronous parallel SGD algorithm (Hogwild! algorithm), (2) the parallel model average SGD algorithm (minibatch SGD algorithm), (3) the decentralization optimization algorithm, and (4) the dual coordinate optimization (DADM algorithm). Our results show that the statistical properties of training datasets determine the scalability upper bound of these parallel training algorithms.

引用

页码：1702 / 1712

页数：11

共 50 条

[31] Combining Parallel Genetic Algorithms and Machine Learning to improve the research of optimal vaccination protocols
Pennisi, Marzio
Russo, Giulia
Pappalardo, Francesco
2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018), 2018, : 399 - 405
[32] Inverse design of multiparameter antenna using hybrid machine learning-driven training dataset
Ahmed, Haroon
Xiaoping, Zeng
Bello, Hilal
Iqbal, Nayyar
MICROWAVE AND OPTICAL TECHNOLOGY LETTERS, 2024, 66 (01)
[33] Training data selection based on dataset distillation for rapid deployment in machine-learning workflows
Yuna Jeong
Myunggwon Hwang
Wonkyung Sung
Multimedia Tools and Applications, 2023, 82 : 9855 - 9870
[34] Empirical Analysis of Machine Learning Algorithms on Imbalance Electrocardiogram Based Arrhythmia Dataset for Heart Disease Detection
Shwet Ketu
Pramod Kumar Mishra
Arabian Journal for Science and Engineering, 2022, 47 : 1447 - 1469
[35] Empirical Analysis of Machine Learning Algorithms on Imbalance Electrocardiogram Based Arrhythmia Dataset for Heart Disease Detection
Ketu, Shwet
Mishra, Pramod Kumar
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2022, 47 (02) : 1447 - 1469
[36] Application of state-of-the-art machine learning algorithms for slope stability prediction by handling outliers of the dataset
Demir, Selcuk
Sahin, Emrehan Kutlug
EARTH SCIENCE INFORMATICS, 2023, 16 (3) : 2497 - 2509
[37] Towards a Utopia of Dataset Sharing: A Case Study on Machine Learning-based Malware Detection Algorithms
Chuang, Ping-Jui
Hsu, Chih-Fan
Chu, Yung-Tien
Huang, Szu-Chun
Huang, Chun-Ying
ASIA CCS'22: PROCEEDINGS OF THE 2022 ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2022, : 479 - 493
[38] Application of state-of-the-art machine learning algorithms for slope stability prediction by handling outliers of the dataset
Selçuk Demir
Emrehan Kutlug Sahin
Earth Science Informatics, 2023, 16 : 2497 - 2509
[39] Analyzing the impact of COVID-19 on flight cancellation using machine learning and deep learning algorithms for a highly unbalanced dataset
Mohammed, Zaid
Asghar, Mamoona
Kanwal, Nadia
INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND ENERGY TECHNOLOGIES (ICECET 2021), 2021, : 418 - 423
[40] Comparative analysis of the performance of selected machine learning algorithms depending on the size of the training sample
Kupidura, Przemyslaw
Kepa, Agnieszka
Krawczyk, Piotr
REPORTS ON GEODESY AND GEOINFORMATICS, 2024, 118 (01) : 53 - 69

← 1 2 3 4 5 →