BigKernel - High Performance CPU-GPU Communication Pipelining for Big Data-style Applications

被引:13
作者
Mokhtari, Reza [1 ]
Stumm, Michael [1 ]
机构
[1] Univ Toronto, Dept Elect & Comp Engn, Toronto, ON, Canada
来源
2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM | 2014年
关键词
GPU; CPU; communication; management; optimization; stream processing; FRAMEWORK;
D O I
10.1109/IPDPS.2014.89
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
GPUs offer an order of magnitude higher compute power and memory bandwidth than CPUs. GPUs therefore might appear to be well suited to accelerate computations that operate on voluminous data sets in independent ways; e.g., for transformations, filtering, aggregation, partitioning or other "Big Data" style processing. Yet experience indicates that it is difficult, and often error-prone, to write GPGPU programs which efficiently process data that does not fit in GPU memory, partly because of the intricacies of GPU hardware architecture and programming models, and partly because of the limited bandwidth available between GPUs and CPUs. In this paper, we propose BigKernel, a scheme that provides pseudo-virtual memory to GPU applications and is implemented using a 4-stage pipeline with automated prefetching to (i) optimize CPU-GPU communication and (ii) optimize GPU memory accesses. BigKernel simplifies the programming model by allowing programmers to write kernels using arbitrarily large data structures that can be partitioned into segments where each segment is operated on independently; these kernels are transformed into BigKernel using straight-forward compiler transformations. Our evaluation on six data-intensive benchmarks shows that BigKernel achieves an average speedup of 1.7 over state-of-the-art double-buffering techniques and an average speedup of 3.0 over corresponding multi-threaded CPU implementations.
引用
收藏
页数:10
相关论文
共 20 条
  • [1] [Anonymous], 2008, IRPTR0805
  • [2] [Anonymous], 2005, Proceedings of HLT/EMNLP on Interactive Demonstrations -
  • [3] Bauer M, 2011, P 2011 INT C HIGH PE, DOI 10.1145/2063384.2063400
  • [4] Meraculous: De Novo Genome Assembly with Short Paired-End Reads
    Chapman, Jarrod A.
    Ho, Isaac
    Sunkara, Sirisha
    Luo, Shujun
    Schroth, Gary P.
    Rokhsar, Daniel S.
    [J]. PLOS ONE, 2011, 6 (08):
  • [5] COMMUNICATION OPTIMIZATIONS FOR IRREGULAR SCIENTIFIC COMPUTATIONS ON DISTRIBUTED-MEMORY ARCHITECTURES
    DAS, R
    UYSAL, M
    SALTZ, J
    HWANG, YS
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1994, 22 (03) : 462 - 478
  • [6] An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems
    Gelado, Isaac
    Cabezas, Javier
    Navarro, Nacho
    Stone, John E.
    Patel, Sanjay
    Hwu, Wen-mei W.
    [J]. ACM SIGPLAN NOTICES, 2010, 45 (03) : 347 - 358
  • [7] Gregg C, 2011, INT SYM PERFORM ANAL, P134, DOI 10.1109/ISPASS.2011.5762730
  • [8] Hagiescu A., 2011, Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), P467, DOI 10.1109/IPDPS.2011.52
  • [9] Han T.D., 2009, P 2 WORKSHOP GEN PUR, P52
  • [10] Scalable Framework for Mapping Streaming Applications onto Multi-GPU Systems
    Huynh, Huynh Phung
    Hagiescu, Andrei
    Wong, Weng-Fai
    Goh, Rick Siow Mong
    [J]. ACM SIGPLAN NOTICES, 2012, 47 (08) : 1 - 10