Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms

被引:15
|
作者
Cheng, Daning [1 ]
Li, Shigang [2 ]
Zhang, Hanping [3 ]
Xia, Fen [3 ]
Zhang, Yunquan [4 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, SKL, Beijing, Peoples R China
[2] Swiss Fed Inst Technol, Dept Comp Sci, Zh, Switzerland
[3] Beijing Wisdom Uranium Technol Co Ltd, Algorithm Dept, Beijing, Peoples R China
[4] Chinese Acad Sci, Inst Comp Technol, SKL Comp Architecture, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Training; Scalability; Machine learning; Machine learning algorithms; Stochastic processes; Task analysis; Upper bound; Parallel training algorithms; training dataset; scalability; stochastic optimization methods;
D O I
10.1109/TPDS.2020.3048836
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
As the training dataset size and the model size of machine learning increase rapidly, more computing resources are consumed to speedup the training process. However, the scalability and performance reproducibility of parallel machine learning training, which mainly uses stochastic optimization algorithms, are limited. In this paper, we demonstrate that the sample difference in the dataset plays a prominent role in the scalability of parallel machine learning algorithms. We propose to use statistical properties of dataset to measure sample differences. These properties include the variance of sample features, sample sparsity, sample diversity, and similarity in sampling sequences. We choose four types of parallel training algorithms as our research objects: (1) the asynchronous parallel SGD algorithm (Hogwild! algorithm), (2) the parallel model average SGD algorithm (minibatch SGD algorithm), (3) the decentralization optimization algorithm, and (4) the dual coordinate optimization (DADM algorithm). Our results show that the statistical properties of training datasets determine the scalability upper bound of these parallel training algorithms.
引用
收藏
页码:1702 / 1712
页数:11
相关论文
共 50 条
  • [11] Machine Learning Algorithms for Detecting and Analyzing Social Bots Using a Novel Dataset
    Jalal, Niyaz
    Ghafoor, Kayhan Z.
    ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 2022, 10 (02): : 11 - 21
  • [12] On Breast Cancer Detection: An Application of Machine Learning Algorithms on the Wisconsin Diagnostic Dataset
    Agarap, Abien Fred M.
    2ND INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND SOFT COMPUTING (ICMLSC 2018), 2015, : 5 - 9
  • [13] Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites
    Huckvale, Erik D.
    Powell, Christian D.
    Jin, Huan
    Moseley, Hunter N. B.
    METABOLITES, 2023, 13 (11)
  • [14] TRAINING DATASET FOR THE MACHINE LEARNING APPROACH IN GLACIER MONITORING APPLYING SAR DATA
    Piwowar, Lukasz
    Lucka, Magdalena
    Witkowski, Wojciech
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 191 - 194
  • [15] Performance Analysis of Machine Learning Algorithms on Diabetes Dataset using Big Data Analytics
    Kumar, P. Suresh
    Pranavi, S.
    2017 INTERNATIONAL CONFERENCE ON INFOCOM TECHNOLOGIES AND UNMANNED SYSTEMS (TRENDS AND FUTURE DIRECTIONS) (ICTUS), 2017, : 508 - 513
  • [16] Employing feature engineering strategies to improve the performance of machine learning algorithms on echocardiogram dataset
    Huang, Huang-Nan
    Chen, Hong-Ming
    Lin, Wei-Wen
    Huang, Chau-Jian
    Chen, Yung-Cheng
    Wang, Yu-Huei
    Yang, Chao-Tung
    DIGITAL HEALTH, 2023, 9
  • [17] ADL Recognition Through Machine Learning Algorithms on IoT Air Quality Sensor Dataset
    Gambi, Ennio
    Temperini, Giulia
    Galassi, Rossana
    Senigagliesi, Linda
    De Santis, Adelmo
    IEEE SENSORS JOURNAL, 2020, 20 (22) : 13562 - 13570
  • [18] Anomaly Detection on MIL-STD-1553 Dataset using Machine Learning Algorithms
    Onodueze, Francis
    Josyula, Darsana
    2020 IEEE 19TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2020), 2020, : 592 - 598
  • [19] Hybrid Training of Supervised Machine Learning Algorithms for Damage Identification in Bridges
    Bud, Mihai Adrian
    Moldovan, Ionut Dragos
    Nedelcu, Mihai
    Figueiredo, Eloi
    EUROPEAN WORKSHOP ON STRUCTURAL HEALTH MONITORING (EWSHM 2022), VOL 3, 2023, : 482 - 491
  • [20] Evaluating the Effects of Modern Storage Devices on the Efficiency of Parallel Machine Learning Algorithms
    Akritidis, Leonidas
    Fevgas, Athanasios
    Tsompanopoulou, Panagiota
    Bozanis, Panayiotis
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2020, 29 (3-4)