Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

被引：47

作者：

Zhao, Xing ^{[1
]}

An, Aijun ^{[1
]}

Liu, Junfeng ^{[2
]}

Chen, Bao Xin ^{[1
]}

机构：

[1] York Univ, Dept Elect Engn & Comp Sci, Toronto, ON, Canada

[2] IBM Canada, Platform Comp, Markham, ON, Canada

来源：

2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019) | 2019年

基金：

加拿大自然科学与工程研究理事会;

关键词：

distributed deep learning; parameter server; BSP; ASP; SSP; GPU cluster;

D O I：

10.1109/ICDCS.2019.00150

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning is a popular machine learning technique and has been applied to many real-world problems, ranging from computer vision to natural language processing. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to train a large model over large datasets. A popular solution is to distribute and parallelize the training process across multiple machines using the parameter server framework. In this paper, we present a distributed paradigm on the parameter server framework called Dynamic Stale Synchronous Parallel (DSSP) which improves the state-of-the-art Stale Synchronous Parallel (SSP) paradigm by dynamically determining the staleness threshold at the run time. Conventionally to run distributed training in SSP, the user needs to specify a particular stalenes threshold as a hyper-parameter. However, a user does not usually know how to set the threshold and thus often finds a threshold value through trial and error, which is time-consuming. Based on workers' recent processing time, our approach DSSP adaptively adjusts the threshold per iteration at running time to reduce the waiting time of faster workers for synchronization of the globally shared parameters (the weights of the model), and consequently increases the frequency of parameters updates (increases iteration throughput), which speedups the convergence rate. We compare DSSP with other paradigms such as Bulk Synchronous Parallel (BSP), Asynchronous Parallel (ASP), and SSP by running deep neural networks (DNN) models over GPU clusters in both homogeneous and heterogeneous environments. The results show that in a heterogeneous environment where the cluster consists of mixed models of GPUs, DSSP converges to a higher accuracy much earlier than SSP and BSP and performs similarly to ASP. In a homogeneous distributed cluster, DSSP has more stable and slightly better performance than SSP and ASP, and converges much faster than BSP.

引用

页码：1507 / 1517

页数：11

共 38 条

[21] Microsoft COCO: Common Objects in Context [J].

Lin, Tsung-Yi ;

Maire, Michael ;

Belongie, Serge ;

Hays, James ;

Perona, Pietro ;

Ramanan, Deva ;

Dollar, Piotr ;

Zitnick, C. Lawrence .

COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755

[22] Roles of Wnt/β-catenin signaling in the gastric cancer stem cells proliferation and salinomycin treatment [J].

Mao, J. ;

Fan, S. ;

Ma, W. ;

Fan, P. ;

Wang, B. ;

Zhang, J. ;

Wang, H. ;

Tang, B. ;

Zhang, Q. ;

Yu, X. ;

Wang, L. ;

Song, B. ;

Li, L. .

CELL DEATH & DISEASE, 2014, 5 :e1039-e1039

[23]

Meng XR, 2016, J MACH LEARN RES, V17

[24]

Neelakantan A., 2015, ADDING GRADIENT NOIS

[25]

Recht Benjamin, 2011, Advances in neural information processing systems, P693

[26]

Simonyan K, 2015, Arxiv, DOI arXiv:1409.1556

[27]

Srivastava N, 2014, J MACH LEARN RES, V15, P1929

[28]

Tompson J, 2015, PROC CVPR IEEE, P648, DOI 10.1109/CVPR.2015.7298664

[29] Dopamine crosslinked graphene oxide membrane for simultaneous removal of organic pollutants and trace heavy metals from aqueous solution [J].

Wang, Jing ;

Huang, Tiefan ;

Zhang, Lin ;

Yu, Qiming Jimmy ;

Hou, Li'an .

ENVIRONMENTAL TECHNOLOGY, 2018, 39 (23) :3055-3065

[30] Accelerating deep neural network training with inconsistent stochastic gradient descent [J].

Wang, Linnan ;

Yang, Yi ;

Min, Renqiang ;

Chakradhar, Srimat .

NEURAL NETWORKS, 2017, 93 :219-229

← 1 2 3 4 →