An Adaptive Synchronous Parallel Strategy for Distributed Machine Learning

被引:24
作者
Zhang, Jilin [1 ,2 ,3 ,4 ,5 ]
Tu, Hangdi [1 ,2 ]
Ren, Yongjian [1 ,2 ]
Wan, Jian [1 ,2 ,4 ,5 ]
Zhou, Li [1 ,2 ]
Li, Mingwei [1 ,2 ]
Wang, Jue [6 ]
机构
[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou 310018, Zhejiang, Peoples R China
[2] Minist Educ, Key Lab Complex Syst Modeling & Simulat, Hangzhou 310018, Zhejiang, Peoples R China
[3] Zhejiang Univ, Coll Elect Engn, Hangzhou 310058, Zhejiang, Peoples R China
[4] Zhejiang Univ Sci & Technol, Sch Informat & Elect Engn, Hangzhou 310023, Zhejiang, Peoples R China
[5] Zhejiang Prov Engn Ctr Media Data Cloud Proc & An, Hangzhou 310018, Zhejiang, Peoples R China
[6] Chinese Acad Sci, Supercomp Ctr Comp Network Informat Ctr, Beijing 100190, Peoples R China
基金
中国国家自然科学基金; 国家高技术研究发展计划(863计划);
关键词
Distributed machine learning; adaptive synchronous parallel; communication strategy; parameter server;
D O I
10.1109/ACCESS.2018.2820899
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, distributed systems have mainly been used to train machine learning (ML) models. However, as a result of the different performances among computational nodes in a distributed cluster and delays in network transmission, the accuracies and convergence rates of ML models are relatively low. Therefore, it is necessary to design a reasonable strategy that provides dynamic communication optimization to improve the utilization of the cluster, accelerate the training times, and strengthen the accuracy of the training model. In this paper, we propose the adaptive synchronous parallel strategy for distributed ML. Through the performance monitoring model, the synchronization strategy of each computational node with the parameter server is adjusted adaptively by considering the full performance of each node, thereby ensuring higher accuracy. Furthermore, our strategy prevents the ML model from being affected by irrelevant tasks in the same cluster. Experiments show that our strategy fully improves clustering performance, and it ensures the accuracy and convergence speed of the model, increases the model training speed, and has good expansibility.
引用
收藏
页码:19222 / 19230
页数:9
相关论文
共 38 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
Ahmed Amr, 2013, WWW, P37
[3]  
[Anonymous], ABSTRACT MACHINE MOD
[4]  
[Anonymous], 2015, ABS151201274 CORR
[5]  
[Anonymous], 2012, PROC 5 ACM INT C WEB
[6]  
[Anonymous], 2012, NIPS
[7]  
[Anonymous], 2011, Advances in Neural Information Processing Systems
[8]  
[Anonymous], 2014, OPERATING SYSTEMS DE
[9]  
[Anonymous], 2010, Advances in neural information processing systems
[10]  
[Anonymous], 2014, P USENIX OSDI