Morph Algorithms on GPUs

被引:54
|
作者
Nasre, Rupesh [1 ]
Burtscher, Martin [2 ]
Pingali, Keshav [1 ,3 ]
机构
[1] Univ Texas Austin, Inst Computat Engn & Sci, Austin, TX 78712 USA
[2] SW Texas State Univ, Dept Comp Sci, San Marcos, TX 78666 USA
[3] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA
基金
美国国家科学基金会;
关键词
Algorithms; Languages; Performance; Morph Algorithms; Graph Algorithms; Irregular Programs; GPU; CUDA; Delaunay Mesh Refinement; Survey Propagation; Minimum Spanning Tree; Boruvka; Points-to Analysis; GRAPH ALGORITHMS; PARALLELISM; CUDA;
D O I
10.1145/2517327.2442531
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively easy compared to general graph algorithms like mesh generation and refinement, which morph the underlying graph in non-trivial ways by adding and removing nodes and edges. We know relatively little about how to implement morph algorithms efficiently on GPUs. In this paper, we present and study four morph algorithms: (i) a computational geometry algorithm called Delaunay Mesh Refinement (DMR), (ii) an approximate SAT solver called Survey Propagation (SP), (iii) a compiler analysis called Points-to Analysis (PTA), and (iv) Boruvka's Minimum Spanning Tree algorithm (MST). Each of these algorithms modifies the graph data structure in different ways and thus poses interesting challenges. We overcome these challenges using algorithmic and GPU-specific optimizations. We propose efficient techniques to perform concurrent subgraph addition, subgraph deletion, conflict detection and several optimizations to improve the scalability of morph algorithms. For an input mesh with 10 million triangles, our DMR code achieves an 80x speedup over the highly optimized serial Triangle program and a 2.3x speedup over a multicore implementation running with 48 threads. Our SP code is 3x faster than a multicore implementation with 48 threads on an input with 1 million literals. The PTA implementation is able to analyze six SPEC 2000 benchmark programs in just 74 milliseconds, achieving a geometric mean speedup of 9.3x over a 48-thread multicore version. Our MST code is slower than a multicore version with 48 threads for sparse graphs but significantly faster for denser graphs. This work provides several insights into how other morph algorithms can be efficiently implemented on GPUs.
引用
收藏
页码:147 / 156
页数:10
相关论文
共 50 条
  • [41] Classical Simulation of Quantum Adiabatic Algorithms Using Mathematica on GPUs
    Diaz-Pier, Sandra
    Venegas-Andraca, Salvador E.
    Luis Gomez-Munoz, Jose
    INTERNATIONAL JOURNAL OF UNCONVENTIONAL COMPUTING, 2011, 7 (05) : 315 - 330
  • [42] Pore Network Simulation via Monte Carlo Algorithms on GPUs
    Matadamas, J.
    Roman, G.
    Rojas, F.
    Castro, M. A.
    Cordero, S.
    Aguilar, M.
    IEEE LATIN AMERICA TRANSACTIONS, 2014, 12 (03) : 491 - 498
  • [43] Speeding up the evaluation phase of GP classification algorithms on GPUs
    Cano, Alberto
    Zafra, Amelia
    Ventura, Sebastian
    SOFT COMPUTING, 2012, 16 (02) : 187 - 202
  • [44] Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs
    Jorda, Marc
    Valero-Lara, Pedro
    Pena, Antonio J.
    IEEE ACCESS, 2019, 7 : 70461 - 70473
  • [45] Massive Atomics for Massive Parallelism on GPUs
    Egielski, Ian
    Huang, Jesse
    Zhang, Eddy Z.
    ACM SIGPLAN NOTICES, 2014, 49 (11) : 93 - 103
  • [46] HyKernel: A Hybrid Selection of One/Two-Phase Kernels for Triangle Counting on GPUs
    Almasri, Mohammad
    Vasudeva, Neo
    Nagi, Rakesh
    Xiong, Jinjun
    Hwu, Wen-Mei
    2021 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2021,
  • [47] Improving FCM and T2FCM Algorithms Performance using GPUs for Medical Images Segmentation
    Shehab, Mohammed A.
    Al-Ayyoub, Mahmoud
    Jararweh, Yaser
    2015 6TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), 2015, : 130 - 135
  • [48] Leveraging GPUs Using Cooperative Loop Speculation
    Samadi, Mehrzad
    Hormati, Amir
    Lee, Janghaeng
    Mahlke, Scott
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2014, 11 (01)
  • [49] GKLEE: Concolic Verification and Test Generation for GPUs
    Li, Guodong
    Li, Peng
    Sawaya, Geof
    Gopalakrishnan, Ganesh
    Ghosh, Indradeep
    Rajan, Sreeranga P.
    ACM SIGPLAN NOTICES, 2012, 47 (08) : 215 - 224
  • [50] Towards accelerating irregular EDA applications with GPUs
    Qian, Hao
    Deng, Yangdong
    Wang, Bo
    Mu, Shuai
    INTEGRATION-THE VLSI JOURNAL, 2012, 45 (01) : 46 - 60