Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

被引：19

作者：

Chen, Chen ^{[1
,3
]}

Weng, Qizhen ^{[1
]}

Wang, Wei ^{[1
]}

Li, Baochun ^{[2
]}

Li, Bo ^{[1
]}

机构：

[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[2] Univ Toronto, Toronto, ON, Canada

[3] Huawei Future Network Theory Lab, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18) | 2018年

关键词：

Distributed deep learning; load balancing; batch size;

D O I：

10.1145/3267809.3275463

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In heterogeneous or shared clusters, distributed learning processes are slowed down by straggling workers. In this work, we propose LB-BSP, a new synchronization scheme that eliminates stragglers by adapting each worker's training load (batch size) to its processing capability. For training in shared production clusters, a prerequisite for deciding the workers' batch sizes is to know their processing speeds before each iteration starts. To this end, we adopt NARX, an extended recurrent neural network that accounts for both the historical speeds and the driving factors such as CPU and memory in prediction.

引用

页码：521 / 521

页数：1