Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

被引:8
作者
Zhang, Zhe [1 ]
Wu, Chuan [1 ]
Li, Zongpeng [2 ]
机构
[1] Univ Hong Kong, Hong Kong, Peoples R China
[2] Wuhan Univ, Wuhan, Peoples R China
来源
IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2021) | 2021年
关键词
APPROXIMATION;
D O I
10.1109/INFOCOM42981.2021.9488678
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Distributed machine learning with multiple concurrent workers has been widely adopted to train large deep neural networks (DNNs). Parameter synchronization is a key component in each iteration of distributed training, where workers exchange locally computed gradients through an AllReduce operation or parameter servers, for global parameter updates. Parameter synchronization often constitutes a significant portion of the training time; minimizing the communication time contributes substantially to DNN training speed-up. Standard ring-based AllReduce or PS architecture work efficiently mostly with homogeneous inter-worker connectivity. However, available bandwidth among workers in real-world clusters is often heterogeneous, due to different hardware configurations, switching topologies, and contention with concurrent jobs. This work investigates the best parameter synchronization topology and schedule among workers for most expedited communication in distributed DNN training. We show that the optimal parameter synchronization topology should be comprised of trees with different workers as roots, each for aggregating or broadcasting a partition of gradients/parameters. We identify near-optimal forest packing to maximally utilize available bandwidth and overlap aggregation and broadcast stages to minimize communication time. We provide theoretical analysis of the performance bound, and show that our scheme outperforms state-of-the-art parameter synchronization schemes by up to 18.3 times with extensive evaluation under various settings.
引用
收藏
页数:10
相关论文
共 30 条
[1]   A scalable, commodity data center network architecture [J].
Al-Fares, Mohammad ;
Loukissas, Alexander ;
Vahdat, Amin .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2008, 38 (04) :63-74
[2]   Big Data: A Survey [J].
Chen, Min ;
Mao, Shiwen ;
Liu, Yunhao .
MOBILE NETWORKS & APPLICATIONS, 2014, 19 (02) :171-209
[3]  
Cheriyan J, 2006, ALGORITHMICA, V45, P21, DOI 10.1007/S00453-005-1188-4
[4]   BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy [J].
Cho, M. ;
Finkler, U. ;
Serrano, M. ;
Kung, D. ;
Hunter, H. .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2019, 63 (06)
[5]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6]   Packing algorithms for arborescences (and spanning trees) in capacitated graphs [J].
Gabow, HN ;
Manu, KS .
MATHEMATICAL PROGRAMMING, 1998, 82 (1-2) :83-109
[7]  
Gibiansky A, 2017, BRINGING HPC TECHNIQ
[8]   DCell: A scalable and fault-tolerant network structure for data centers [J].
Guo, Chuanxiong ;
Wu, Haitao ;
Tan, Kun ;
Shi, Lei ;
Zhang, Yongguang ;
Lu, Songwu .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2008, 38 (04) :75-86
[9]   BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers [J].
Guo, Chuanxiong ;
Lu, Guohan ;
Li, Dan ;
Wu, Haitao ;
Zhang, Xuan ;
Shi, Yunfeng ;
Tian, Chen ;
Zhang, Yongguang ;
Lu, Songwu .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2009, 39 (04) :63-74
[10]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778