Efficient Mini-batch Training for Stochastic Optimization

被引：538

作者：

Li, Muu ^{[1
,2
]}

Zhang, Tong ^{[2
,3
]}

Chen, Yuqiang ^{[2
]}

Smola, Alexander J. ^{[1
,4
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Baidu Inc, Beijing, Peoples R China

[3] Rutgers State Univ, New Brunswick, NJ USA

[4] Google Inc, Mountain View, CA USA

来源：

PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14) | 2014年

关键词：

D O I：

10.1145/2623330.2623612

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Stochastic gradient descent (SGD) is a popular technique for large-scale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch size typically decreases the rate of convergence. This paper introduces a technique based on approximate optimization of a conservatively regularized objective function within each minibatch. We prove that the convergence rate does not decrease with increasing minibatch size. Experiments demonstrate that with suitable implementations of approximate optimization, the resulting algorithm can outperform standard SGD in many scenarios.

引用

页码：661 / 670

页数：10

共 27 条

[1]

[Anonymous], 2010, Proceedings of the Fourteenth Conference on Computational Natural Language Learning (CoNLL'10)

[2]

[Anonymous], 2011, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)

[3]

[Anonymous], 2012, INT C MACH LEARN

[4]

[Anonymous], 2010, Advances in neural information processing systems

[5]

[Anonymous], 2009, Advances in Neural Information Processing Systems

[6]

[Anonymous], 2011, PROC 17 ACM SIGKDD I

[7]

BESAG J, 1974, J ROY STAT SOC B MET, V36, P192

[8] Distributed optimization and statistical learning via the alternating direction method of multipliers [J].

Boyd S. ;

Parikh N. ;

Chu E. ;

Peleato B. ;

Eckstein J. .

Foundations and Trends in Machine Learning, 2010, 3 (01) :1-122

[9] Sample size selection in optimization methods for machine learning [J].

Byrd, Richard H. ;

Chin, Gillian M. ;

Nocedal, Jorge ;

Wu, Yuchen .

MATHEMATICAL PROGRAMMING, 2012, 134 (01) :127-155

[10]

Byrd Richard H, 2014, ARXIV PREPRINT ARXIV

← 1 2 3 →