An Implementation of Block Conjugate Gradient Algorithm on CPU-GPU Processors

被引:4
|
作者
Ji, Hao [1 ]
Sosonkina, Masha [2 ]
Li, Yaohang [1 ]
机构
[1] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[2] Old Dominion Univ, Dept Modeling Simulat & Visualizat Engn, Norfolk, VA 23529 USA
来源
2014 HARDWARE-SOFTWARE CO-DESIGN FOR HIGH PERFORMANCE COMPUTING (CO-HPC) | 2014年
关键词
Block Conjugate Gradient; Multi-core CPU; Graphics Processing Unit; Intel Xeon Phi; Performance Evaluation;
D O I
10.1109/Co-HPC.2014.10
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we investigate the implementation of the Block Conjugate Gradient (BCG) algorithm on CPU-GPU processors. By analyzing the performance of various matrix operations in BCG, we identify the main performance bottleneck in constructing new search direction matrices. Replacing the QR decomposition by eigendecomposition of a small matrix remedies the problem by reducing the computational cost of generating orthogonal search directions. Moreover, a hybrid (offload) computing scheme is designed to enables the BCG implementation to handle linear systems with large, sparse coefficient matrices that cannot fit in the GPU memory. The hybrid scheme offloads matrix operations to GPU processors while helps hide the CPU-GPU memory transaction overhead. We compare the performance of our BCG implementation with the one on CPU with Intel Xeon Phi coprocessors using the automatic offload mode. With sufficient number of right hand sides, the CPU-GPU implementation of BCG can reach speedup of 2.61 over the CPU-only implementation, which is significantly higher than that of the CPU-Intel Xeon Phi implementation.
引用
收藏
页码:72 / 77
页数:6
相关论文
共 39 条
  • [31] Scalable and Efficient Spatial Data Management on Multi-Core CPU and GPU Clusters: A Preliminary Implementation based on Impala
    You, Simin
    Zhang, Jianting
    Gruenwald, Le
    2015 13TH IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW), 2015, : 143 - 148
  • [32] Parallel Implementation of FP Growth Algorithm on XML Data Using Multiple GPU
    Rathi, Sheetal
    Dhote, C. A.
    INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS, VOL 1, 2015, 339 : 581 - 589
  • [33] Optimized thread-block arrangement in a GPU implementation of a linear solver for atmospheric chemistry mechanisms
    Ruiz, Christian Guzman
    Acosta, Mario
    Jorba, Oriol
    Galobardes, Eduardo Cesar
    Dawson, Matthew
    Oyarzun, Guillermo
    Garcia-Pando, Carlos Perez
    Serradell, Kim
    COMPUTER PHYSICS COMMUNICATIONS, 2024, 302
  • [34] Optimized Implementation of PIPO Block Cipher on 32-Bit ARM and RISC-V Processors
    Kim, Youngbeom
    Seo, Seog Chung
    IEEE ACCESS, 2022, 10 : 97298 - 97309
  • [35] A cost-optimal parallel algorithm for the 0-1 knapsack problem and its performance on multicore CPU and GPU implementations
    Li, Kenli
    Liu, Jing
    Wan, Lanjun
    Yin, Shu
    Li, Keqin
    PARALLEL COMPUTING, 2015, 43 : 27 - 42
  • [36] A Parallel Genetic Algorithm With Dispersion Correction for HW/SW Partitioning on Multi-Core CPU and Many-Core GPU
    Hou, Neng
    He, Fazhi
    Zhou, Yi
    Chen, Yilin
    Yan, Xiaohu
    IEEE ACCESS, 2018, 6 : 883 - 898
  • [37] ICCM2016: The Implementation of Two-Dimensional Multi-Block Lattice Boltzmann Method on GPU
    Zhang, Ya
    Pan, Cuang
    Huang, Qiaogao
    INTERNATIONAL JOURNAL OF COMPUTATIONAL METHODS, 2019, 16 (05)
  • [38] Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging
    Jaros, Milan
    Strakos, Petr
    Karasek, Tomas
    Riha, Lubomir
    Vasatova, Alena
    Jarogova, Marta
    Kozubek, Tomas
    ADVANCES IN ENGINEERING SOFTWARE, 2017, 103 : 21 - 28
  • [39] Enhancement of membrane computing model implementation on GPU by introducing matrix representation for balancing occupancy and reducing inter-block communications
    Maroosi, Ali
    Muniyandi, Ravie Chandren
    JOURNAL OF COMPUTATIONAL SCIENCE, 2014, 5 (06) : 861 - 871