CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

被引:11
|
作者
Yang, Yi [1 ]
Li, Chao [2 ]
Zhou, Huiyang [2 ]
机构
[1] NEC Labs Amer, Dept Comp Syst Architecture, Princeton, NJ 08540 USA
[2] N Carolina State Univ, Dept Elect & Comp Engn, Raleigh, NC 27606 USA
基金
美国国家科学基金会;
关键词
GPGPU; nested parallelism; compiler; local memory; OPENMP; PERFORMANCE; COMPILER; OPTIMIZATION; FRAMEWORK; DESIGN;
D O I
10.1007/s11390-015-1500-y
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.
引用
收藏
页码:3 / 19
页数:17
相关论文
共 15 条
  • [1] CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications
    Yang, Yi
    Zhou, Huiyang
    ACM SIGPLAN NOTICES, 2014, 49 (08) : 93 - 105
  • [2] CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications
    Yi Yang
    Chao Li
    Huiyang Zhou
    Journal of Computer Science and Technology, 2015, 30 : 3 - 19
  • [3] Exploitation of Nested Thread-Level Speculative Parallelism on Multi-Core Systems
    Kejariwal, Arun
    Girkar, Milind
    Tian, Xinmin
    Saito, Hideki
    Nicolau, Alexandru
    Veidenbaum, Alexander V.
    Banerjee, Utpal
    Polychronopoulos, Constantine D.
    PROCEEDINGS OF THE 2010 COMPUTING FRONTIERS CONFERENCE (CF 2010), 2010, : 99 - 100
  • [4] Compiler-Driven Software Speculation for Thread-Level Parallelism
    Yiapanis, Paraskevas
    Brown, Gavin
    Lujan, Mikel
    ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS, 2016, 38 (02):
  • [5] Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit
    Yoon, Myung Kuk
    Kim, Keunsoo
    Lee, Sangpil
    Ro, Won Woo
    Annavaram, Murali
    2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, : 609 - 621
  • [6] Programming Matrix Algorithms-by-Blocks for Thread-Level Parallelism
    Quintana-Orti, Gregorio
    Quintana-Orti, Enrique S.
    Van de Geijn, Robert A.
    Van Zee, Field G.
    Chan, Ernie
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2009, 36 (03):
  • [7] Parallelization Spectroscopy: Analysis of Thread-level Parallelism in HPC Programs
    Kejariwal, Arun
    Cascaval, Calin
    ACM SIGPLAN NOTICES, 2009, 44 (04) : 293 - 294
  • [8] Exploiting Thread-Level Parallelism on HEVC by Employing a Reference Dependency Graph
    Kim, Minwoo
    Kim, Deokho
    Kim, Kyungah
    Ro, Won Woo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2016, 26 (04) : 736 - 749
  • [9] Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs
    Xie, Xiaolong
    Liang, Yun
    Li, Xiuhong
    Wu, Yudong
    Sun, Guangyu
    Wang, Tao
    Fan, Dongrui
    PROCEEDINGS OF THE 48TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO-48), 2015, : 395 - 406
  • [10] Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications
    Gutierrez, Samuel K.
    Davis, Kei
    Arnold, Dorian C.
    Baker, Randal S.
    Robey, Robert W.
    McCormick, Patrick
    Holladay, Daniel
    Dahl, Jon A.
    Zerr, R. Joe
    Weik, Florian
    Junghans, Christoph
    2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2017, : 469 - 478