Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

被引:19
作者
Chen, Chen [1 ,3 ]
Weng, Qizhen [1 ]
Wang, Wei [1 ]
Li, Baochun [2 ]
Li, Bo [1 ]
机构
[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[2] Univ Toronto, Toronto, ON, Canada
[3] Huawei Future Network Theory Lab, Hong Kong, Peoples R China
来源
PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18) | 2018年
关键词
Distributed deep learning; load balancing; batch size;
D O I
10.1145/3267809.3275463
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In heterogeneous or shared clusters, distributed learning processes are slowed down by straggling workers. In this work, we propose LB-BSP, a new synchronization scheme that eliminates stragglers by adapting each worker's training load (batch size) to its processing capability. For training in shared production clusters, a prerequisite for deciding the workers' batch sizes is to know their processing speeds before each iteration starts. To this end, we adopt NARX, an extended recurrent neural network that accounts for both the historical speeds and the driving factors such as CPU and memory in prediction.
引用
收藏
页码:521 / 521
页数:1
相关论文
共 4 条
  • [1] [Anonymous], 2008, WSEAS Transactions on Computer Research
  • [2] [Anonymous], 2014, USENIX OSDI
  • [3] [Anonymous], ACM SIGMOD
  • [4] [Anonymous], ACM SOCC