Efficient Mini-batch Training for Stochastic Optimization

被引:538
作者
Li, Muu [1 ,2 ]
Zhang, Tong [2 ,3 ]
Chen, Yuqiang [2 ]
Smola, Alexander J. [1 ,4 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Baidu Inc, Beijing, Peoples R China
[3] Rutgers State Univ, New Brunswick, NJ USA
[4] Google Inc, Mountain View, CA USA
来源
PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14) | 2014年
关键词
D O I
10.1145/2623330.2623612
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Stochastic gradient descent (SGD) is a popular technique for large-scale optimization problems in machine learning. In order to parallelize SGD, minibatch training needs to be employed to reduce the communication cost. However, an increase in minibatch size typically decreases the rate of convergence. This paper introduces a technique based on approximate optimization of a conservatively regularized objective function within each minibatch. We prove that the convergence rate does not decrease with increasing minibatch size. Experiments demonstrate that with suitable implementations of approximate optimization, the resulting algorithm can outperform standard SGD in many scenarios.
引用
收藏
页码:661 / 670
页数:10
相关论文
共 27 条
[1]  
[Anonymous], 2010, Proceedings of the Fourteenth Conference on Computational Natural Language Learning (CoNLL'10)
[2]  
[Anonymous], 2011, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
[3]  
[Anonymous], 2012, INT C MACH LEARN
[4]  
[Anonymous], 2010, Advances in neural information processing systems
[5]  
[Anonymous], 2009, Advances in Neural Information Processing Systems
[6]  
[Anonymous], 2011, PROC 17 ACM SIGKDD I
[7]  
BESAG J, 1974, J ROY STAT SOC B MET, V36, P192
[8]   Distributed optimization and statistical learning via the alternating direction method of multipliers [J].
Boyd S. ;
Parikh N. ;
Chu E. ;
Peleato B. ;
Eckstein J. .
Foundations and Trends in Machine Learning, 2010, 3 (01) :1-122
[9]   Sample size selection in optimization methods for machine learning [J].
Byrd, Richard H. ;
Chin, Gillian M. ;
Nocedal, Jorge ;
Wu, Yuchen .
MATHEMATICAL PROGRAMMING, 2012, 134 (01) :127-155
[10]  
Byrd Richard H, 2014, ARXIV PREPRINT ARXIV