Effective Multi-GPU Communication Using Multiple CUDA Streams and Threads

被引:0
作者
Sourouri, Mohammed [1 ,2 ]
Gillberg, Tor [1 ]
Baden, Scott B. [3 ]
Cai, Xing [1 ,2 ]
机构
[1] Simula Res Lab, POB 134, N-1325 Lysaker, Norway
[2] Univ Oslo, Dept Informat, N-0316 Oslo, Norway
[3] Univ Calif San Diego, La Jolla, CA 92093 USA
来源
2014 20TH IEEE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS) | 2014年
基金
美国国家科学基金会;
关键词
GPU; CUDA; OpenMP; MPI; overlap communication with computation; multi-GPU;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In the context of multiple GPUs that share the same PCIe bus, we propose a new communication scheme that leads to a more effective overlap of communication and computation. Multiple CUDA streams and OpenMP threads are adopted so that data can simultaneously be sent and received. A representative 3D stencil example is used to demonstrate the effectiveness of our scheme. We compare the performance of our new scheme with an MPI-based state-of-the-art scheme. Results show that our approach outperforms the state-of-the-art scheme, being up to 1.85x faster. However, our performance results also indicate that the current underlying PCIe bus architecture needs improvements to handle the future scenario of many GPUs per node.
引用
收藏
页码:981 / 986
页数:6
相关论文
共 12 条
  • [1] [Anonymous], P INT C HIGH PERF CO
  • [2] Benchmarking of communication techniques for GPUs
    Bernaschi, M.
    Bisson, M.
    Rossetti, D.
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (02) : 250 - 255
  • [3] Jacobsen D., 2010, P 48 AIAA AER SCI M, DOI [10.2514/6.2010-522, DOI 10.2514/6.2010-522]
  • [4] Maruyama N., 2011, P 2011 INT C HIGH PE
  • [5] Message Passing Interface Forum, 2012, MPI MESS PAS INT STA
  • [6] Micikevicius P., 2009, Proceedings of Second Workshop on General Purpose Processing on Graphics Processing Units, V383, P79, DOI DOI 10.1145/1513895.1513905
  • [7] Micikevicius Paulius., MULTIGPU PROGRAMMING
  • [8] OpenMP Architecture Review Board, 2013, OpenMP Application Program Interface
  • [9] Phillips E. H., 2010, Implementing the Himeno benchmark with CUDA on GPU clusters (IPDPS '10), P1
  • [10] Playne D. P., 2011, INT C PAR DISTR PROC, P169