Scalable Distributed DNN Training Using Commodity GPU Cloud Computing

被引:0
作者
Strom, Nikko [1 ]
机构
[1] Amazon Com, Seattle, WA 98109 USA
来源
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5 | 2015年
关键词
Speech recognition; deep neural networks; distributed stochastic gradient descent;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We introduce a new method for scaling up distributed Stochastic Gradient Descent (SGD) training of Deep Neural Networks (DNN). The method solves the well-known communication bottleneck problem that arises for data-parallel SGD because compute nodes frequently need to synchronize a replica of the model. We solve it by purposefully controlling the rate of weight-update per individual weight, which is in contrast to the uniform update-rate customarily imposed by the size of a mini-batch. It is shown empirically that the method can reduce the amount of communication by three orders of magnitude while training a typical DNN for acoustic modelling. This reduction in communication bandwidth enables efficient scaling to more parallel GPU nodes than any other method that we are aware of, and it can be achieved with neither loss in convergence rate nor accuracy in the resulting DNN. Furthermore, the training can be performed on commodity cloud infrastructure and networking.
引用
收藏
页码:1488 / 1492
页数:5
相关论文
共 22 条
[1]  
Agarwal A., 2011, P ADV NEUR INF PROC
[2]  
[Anonymous], 2012, IEEE SIGNAL PROCESSI
[3]  
[Anonymous], ADADELTA: An Adaptive Learning Rate Method
[4]  
[Anonymous], 2013, INT C MACH LEARN
[5]  
Dean J., 2012, NIPS
[6]  
Duchi J, 2011, J MACH LEARN RES, V12, P2121
[7]  
LeCun Y., 2004, P COMP VIS PATT REC
[8]  
Martens J., 2010, P ICML
[9]  
Nesterov Y., 2007, Gradient Methods for Minimizing Composite Objective Function
[10]  
Povey D., 2015, P ICLR 2015 SAN DIEG