An Implementation of Block Conjugate Gradient Algorithm on CPU-GPU Processors

被引:4
|
作者
Ji, Hao [1 ]
Sosonkina, Masha [2 ]
Li, Yaohang [1 ]
机构
[1] Old Dominion Univ, Dept Comp Sci, Norfolk, VA 23529 USA
[2] Old Dominion Univ, Dept Modeling Simulat & Visualizat Engn, Norfolk, VA 23529 USA
来源
2014 HARDWARE-SOFTWARE CO-DESIGN FOR HIGH PERFORMANCE COMPUTING (CO-HPC) | 2014年
关键词
Block Conjugate Gradient; Multi-core CPU; Graphics Processing Unit; Intel Xeon Phi; Performance Evaluation;
D O I
10.1109/Co-HPC.2014.10
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we investigate the implementation of the Block Conjugate Gradient (BCG) algorithm on CPU-GPU processors. By analyzing the performance of various matrix operations in BCG, we identify the main performance bottleneck in constructing new search direction matrices. Replacing the QR decomposition by eigendecomposition of a small matrix remedies the problem by reducing the computational cost of generating orthogonal search directions. Moreover, a hybrid (offload) computing scheme is designed to enables the BCG implementation to handle linear systems with large, sparse coefficient matrices that cannot fit in the GPU memory. The hybrid scheme offloads matrix operations to GPU processors while helps hide the CPU-GPU memory transaction overhead. We compare the performance of our BCG implementation with the one on CPU with Intel Xeon Phi coprocessors using the automatic offload mode. With sufficient number of right hand sides, the CPU-GPU implementation of BCG can reach speedup of 2.61 over the CPU-only implementation, which is significantly higher than that of the CPU-Intel Xeon Phi implementation.
引用
收藏
页码:72 / 77
页数:6
相关论文
共 39 条
  • [21] GPU-Based N-1 Static Security Analysis Algorithm With Preconditioned Conjugate Gradient Method
    Fu, Meng
    Zhou, Gan
    Zhao, Jiahao
    Feng, Yanjun
    He, Huan
    Liang, Kai
    IEEE ACCESS, 2020, 8 : 124066 - 124075
  • [22] Implementing Delay Multiply and Sum Beamformer on a Hybrid CPU-GPU Platform for Medical Ultrasound Imaging Using OpenMP and CUDA
    Song, Ke
    Liu, Paul
    Liu, Dongquan
    CMES-COMPUTER MODELING IN ENGINEERING & SCIENCES, 2021, 128 (03): : 1133 - 1150
  • [23] dStream: An Online-Based Dynamic Operator-Level Query Mapping Scheme on Discrete CPU-GPU Architectures
    Jung, Gyeonghwan
    Jeong, Yeonwoo
    Park, Kyuli
    Lee, Dongjae
    Byun, Hongsu
    Lee, Suyeon
    Park, Sungyong
    IEEE ACCESS, 2025, 13 : 8239 - 8256
  • [24] Implementation of Pulse Compression Algorithm Based on Multicore CPU
    Liu Nanyang
    Zhang Yue
    Chen Zengping
    2018 EIGHTH INTERNATIONAL CONFERENCE ON INSTRUMENTATION AND MEASUREMENT, COMPUTER, COMMUNICATION AND CONTROL (IMCCC 2018), 2018, : 1770 - 1773
  • [25] A New Projected Variant of the Deflated Block Conjugate Gradient Method
    Xiang, Yan-Fei
    Jing, Yan-Fei
    Huang, Ting-Zhu
    JOURNAL OF SCIENTIFIC COMPUTING, 2019, 80 (02) : 1116 - 1138
  • [26] A New Projected Variant of the Deflated Block Conjugate Gradient Method
    Yan-Fei Xiang
    Yan-Fei Jing
    Ting-Zhu Huang
    Journal of Scientific Computing, 2019, 80 : 1116 - 1138
  • [27] The Implementation and Performance Analysis of AWMMF Parallel Algorithm on GPU
    Mu, Weiyang
    Jin, Jing
    Feng, Hongqi
    Wang, Qiang
    2013 IEEE INTERNATIONAL INSTRUMENTATION AND MEASUREMENT TECHNOLOGY CONFERENCE (I2MTC), 2013, : 1530 - 1534
  • [28] An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
    Lyakh, Dmitry I.
    COMPUTER PHYSICS COMMUNICATIONS, 2015, 189 : 84 - 91
  • [29] A Fast Parallel GPS Acquisition Algorithm Based on Hybrid GPU and Multi-core CPU
    Mohammad Kakooei
    Amir Tabatabaei
    Wireless Personal Communications, 2019, 104 : 1355 - 1366
  • [30] A Fast Parallel GPS Acquisition Algorithm Based on Hybrid GPU and Multi-core CPU
    Kakooei, Mohammad
    Tabatabaei, Amir
    WIRELESS PERSONAL COMMUNICATIONS, 2019, 104 (04) : 1355 - 1366