Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

被引：9

作者：

Chu, Ching-Hsiang ^{[1
]}

Lu, Xiaoyi ^{[1
]}

Awan, Ammar A. ^{[1
]}

Subramoni, Hari ^{[1
]}

Hashmi, Jahanzeb ^{[1
]}

Elton, Bracy ^{[2
]}

Panda, Dhabaleswar K. ^{[1
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Engility Corp, Wright Patterson AFB, OH USA

来源：

2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP) | 2017年

关键词：

Deep Learning; InfiniBand; MPI; Hardware Multicast; Multi-source Broadcast; GPU; GPUDirect RDMA; Streaming;

D O I：

10.1109/ICPP.2017.25

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Broadcast operations (e.g. MPI Bcast) have been widely used in deep learning applications to exchange a large amount of data among multiple graphics processing units (GPUs). Recent studies have shown that leveraging the InfiniBand hardware-based multicast (IB-MCAST) protocol can enhance scalability of GPU-based broadcast operations. However, these initial designs with IB-MCAST are not optimized for multi-source broadcast operations with large messages, which is the common communication scenario for deep learning applications. In this paper, we first model existing broadcast schemes and analyze their performance bottlenecks on GPU clusters. Then, we propose a novel broadcast design based on message streaming to better exploit IB-MCAST and NVIDIA GPUDirect RDMA (GDR) technology for efficient large message transfer operation. The proposed design can provide high overlap among multi-source broadcast operations. Experimental results show up to 68% reduction of latency compared to state-of-the-art solutions in a benchmark-level evaluation. The proposed design also shows near-constant latency for a single broadcast operation as a system grows. Furthermore, it yields up to 24% performance improvement in the popular deep learning framework, Microsoft CNTK, which uses multi-source broadcast operations; notably, the performance gains are achieved without modifications to applications. Our model validation shows that the proposed analytical model and experimental results match within a 10% range. Our model also predicts that the proposed design outperforms existing schemes for multi-source broadcast scenarios with increasing numbers of broadcast sources in large-scale GPU clusters.

引用

页码：161 / 170

页数：10

共 19 条

[1] Abadi M., 2016, TENSORFLOW LARGE SCA
[2] [Anonymous], MICR COGN TOOLK
[3] [Anonymous], OPT PRIM COLL MULTIG
[4] [Anonymous], CAFF FAST OP FRAM DE
[5] Awan AA, 2017, ACM SIGPLAN NOTICES, V52, P193, DOI [10.1145/3155284.3018769, 10.1145/3018743.3018769]
[6] Banerjee DS, 2016, INT CONF CLOUD COMP, P144, DOI [10.1109/CloudCom.2016.33, 10.1109/CloudCom.2016.0036]
[7] Bureddy Devendar, 2012, P EUR MPI US GROUP M, P110, DOI [DOI 10.1007/978-3-642-33518-116, 10.1007/978-3-642-33518-116]
[8] Chu CH, 2016, PROCEEDINGS OF FIRST WORKSHOP ON OPTIMIZATION OF COMMUNICATION IN HPC RUNTIME SYSTEMS (COM-HPC 2016), P29, DOI [10.1109/COMHPC.2016.009, 10.1109/COM-HPC.2016.9]
[9] Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters
Chu, C. -H.
Hamidouche, K.
Subramoni, H.
Venkatesh, A.
Elton, B.
Panda, D. K.
[J]. PROCEEDINGS OF 28TH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING, (SBAC-PAD 2016), 2016, : 59 - 66
[10] Hoefler T., 2007, PROC INT WORKSHOP CO, P232

← 1 2 →