Fast Distributed Deep Learning via Worker-adaptive Batch Sizing
被引:19
作者:
论文数: 引用数:
h-index:
机构:
Chen, Chen
[1
,3
]
Weng, Qizhen
论文数: 0引用数: 0
h-index: 0
机构:
Hong Kong Univ Sci & Technol, Hong Kong, Peoples R ChinaHong Kong Univ Sci & Technol, Hong Kong, Peoples R China
Weng, Qizhen
[1
]
论文数: 引用数:
h-index:
机构:
Wang, Wei
[1
]
Li, Baochun
论文数: 0引用数: 0
h-index: 0
机构:
Univ Toronto, Toronto, ON, CanadaHong Kong Univ Sci & Technol, Hong Kong, Peoples R China
Li, Baochun
[2
]
Li, Bo
论文数: 0引用数: 0
h-index: 0
机构:
Hong Kong Univ Sci & Technol, Hong Kong, Peoples R ChinaHong Kong Univ Sci & Technol, Hong Kong, Peoples R China
Li, Bo
[1
]
机构:
[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[2] Univ Toronto, Toronto, ON, Canada
[3] Huawei Future Network Theory Lab, Hong Kong, Peoples R China
来源:
PROCEEDINGS OF THE 2018 ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '18)
|
2018年
关键词:
Distributed deep learning;
load balancing;
batch size;
D O I:
10.1145/3267809.3275463
中图分类号:
TP [自动化技术、计算机技术];
学科分类号:
0812 ;
摘要:
In heterogeneous or shared clusters, distributed learning processes are slowed down by straggling workers. In this work, we propose LB-BSP, a new synchronization scheme that eliminates stragglers by adapting each worker's training load (batch size) to its processing capability. For training in shared production clusters, a prerequisite for deciding the workers' batch sizes is to know their processing speeds before each iteration starts. To this end, we adopt NARX, an extended recurrent neural network that accounts for both the historical speeds and the driving factors such as CPU and memory in prediction.
引用
收藏
页码:521 / 521
页数:1
相关论文
共 4 条
[1]
[Anonymous], 2008, WSEAS Transactions on Computer Research