The Unicorn Runtime: Efficient Distributed Shared Memory Programming for Hybrid CPU-GPU Clusters

被引:11
作者
Beri, Tarun [1 ]
Bansal, Sorav [1 ]
Kumar, Subodh [1 ]
机构
[1] Indian Inst Technol, Dept Comp Sci & Engn, New Delhi 110016, India
关键词
Unicorn runtime; distributed system design; scheduling; load balancing; accelerators; bulk synchronous parallelism; SYSTEM;
D O I
10.1109/TPDS.2016.2616314
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Programming hybrid CPU-GPU clusters is hard. This paper addresses this difficulty and presents the design and runtime implementation of Unicorn-a parallel programming model for hybrid CPU-GPU clusters. In particular, this paper proves that efficient distributed shared memory style programing is possible and its simplicity can be retained across CPUs and GPUs in a cluster, minus the frustration of dealing with race conditions. Further, this can be done with a unified abstraction, avoiding much of the complication of dealing with hybrid architectures. This is achieved with the help of transactional semantics (on shared global address spaces), deferred bulk data synchronization, workload pipelining and various communication and computation scheduling optimizations. We describe the said abstraction, our computation and communication scheduling system and report its performance on a few benchmarks like Matrix Multiplication, LU Decomposition and 2D FFT. We find that parallelization of coarse-grained applications like matrix multiplication or 2D FFT using our system requires only about 30 lines of C code to set up the runtime. The rest of the application code is regular single CPU/GPU implementation. This indicates the ease of extending parallel code to a distributed environment. The execution is efficient as well. When multiplying two square matrices of size 65, 536 x 65, 536, Unicornachieves a peak performance of 7.88 TFlop/s when run over a cluster of 14 nodes with each node equipped with two Tesla M2070 GPUs and two 6-core Intel Xeon 2.67 GHz CPUs, connected over a 32 Gbps Infiniband network. In this paper, we also demonstrate that the Unicorn programming model can be efficiently used to implement high level abstractions like MapReduce. We use such an extension to implement PageRank and report its performance. For a sample web of 500 million web pages, our implementation completes a page rank iteration in about 18 seconds (on average) on a 14-node cluster.
引用
收藏
页码:1518 / 1534
页数:17
相关论文
共 40 条
[1]   MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems [J].
Aji, Ashwin M. ;
Dinan, James ;
Buntinas, Darius ;
Balaji, Pavan ;
Feng, Wu-chun ;
Bisset, Keith R. ;
Thakur, Rajeev .
2012 IEEE 14TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2012 IEEE 9TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (HPCC-ICESS), 2012, :647-654
[2]  
[Anonymous], 2007, NVIDIA CUDA BASIC LI
[3]  
[Anonymous], 1999, PAGERANK CITATION RA
[4]   StarPU: a unified platform for task scheduling on heterogeneous multicore architectures [J].
Augonnet, Cedric ;
Thibault, Samuel ;
Namyst, Raymond ;
Wacrenier, Pierre-Andre .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2011, 23 (02) :187-198
[5]  
Augonnet Cedric., 2012, European MPI Users' Group Meeting, P298, DOI DOI 10.1007/978-3-642-33518-1_40
[6]  
Bauer M, 2012, INT CONF HIGH PERFOR
[7]  
Beguelin A., 1991, ORNLTM11826 U TENN
[8]   Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle [J].
Ben-Nun, Tal ;
Levy, Ely ;
Barak, Amnon ;
Rubin, Eri .
PROCEEDINGS OF SC15: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2015,
[9]  
Beri A. Tarun, 2015, P INT C PAR DISTR PR, P48
[10]   ProSteal: A proactive work stealer for bulk synchronous tasks distributed on a cluster of heterogeneous machines with multiple accelerators [J].
Beri, Tarun ;
Bansal, Sorav ;
Kumar, Subodh .
2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, 2015, :17-26