A Coprocessor for Double-Precision Floating-Point Matrix Multiplication

被引:0
作者
Jia X. [1 ]
Wu G. [1 ]
Xie X. [1 ]
Wu D. [1 ]
机构
[1] State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, 214125, Jiangsu
来源
Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2019年 / 56卷 / 02期
基金
中国国家自然科学基金;
关键词
Acceleration; Coprocessor; Floating-point; Hardware customization; Matrix multiplication;
D O I
10.7544/issn1000-1239.2019.20170908
中图分类号
学科分类号
摘要
Matrix multiplication has been widely used in various application fields, especially the field of numerical computation. However, double-precision floating-point matrix multiplication suffers from non-optimal performance or efficiency on contemporary computing platforms, including CPU, GPGPU and FPGA. To address this problem, acceleration of double-precision floating-point matrix multiplication with a customized coprocessor is proposed in this paper, which adopts linear array as the basic building block. Firstly, double-buffering technique and optimized memory scheduling are applied to the basic linear array for better computation efficiency. Then, architecture of the matrix multiplication coprocessor and coprocessor-based accelerated computing system are formulated. Furthermore, a performance model tailored for the coprocessor is developed and the design space of coprocessor is explored in detail. Finally, functional correctness of the coprocessor is verified and its hardware implementation cost under mainstream technology node is evaluated. Experimental results show that the proposed coprocessor can achieve the performance of 3 TFLOPS and the efficiency of 99%. Compared with NVIDIA K40 GPGPU for executing double-precision floating-point matrix multiplication, the coprocessor proposed in this paper achieves 1.95× performance with hardware overheads of only 21.05% in area. This work explores the application of customized acceleration in high-performance computing and has certain guidance for improving performance of existing computing systems. © 2019, Science Press. All right reserved.
引用
收藏
页码:410 / 420
页数:10
相关论文
共 27 条
  • [11] Kumar V.B.Y., Joshi S., Patkar S.B., Et al., FPGA based high performance double-precision matrix multiplication, International Journal of Parallel Programming, 38, 3, pp. 322-338, (2010)
  • [12] Wu G., Parallel algorithms and architectures for matrix computations on FPGA, (2011)
  • [13] Wu G., Dou Y., Wang M., High performance and memory efficient implementation of matrix multiplication on FPGAs, Proc of the 9th IEEE Int Conf on Field Programmable Technology, pp. 134-137, (2010)
  • [14] Zhou L., Tao Y., Liu S., Et al., Research on systolic multiplication and technology based on FPGA, Journal of Computer Science and Engineering, 37, 9, pp. 1632-1636, (2015)
  • [15] Jovanovi Z., Milutinovi V., FPGA accelerator for floating-point matrix multiplication, IET Computers & Digital Techniques, 6, 4, pp. 249-256, (2012)
  • [16] Lei Y., Chen X., Peng Y., A high energy efficiency FFT accelerator on DSP chip, Journal of Computer Research and Development, 53, 7, pp. 1438-1446, (2016)
  • [17] Qian L., Zhao J., Peng D., Et al., Energy-efficient fingerprint matching based on reconfigurable micro server, Journal of Computer Research and Development, 53, 7, pp. 1425-1437, (2016)
  • [18] Jouppi N.P., Young C., Patil N., Et al., In-datacenter performance analysis of a tensor processing unit, Proc of the 44th IEEE Int Symp on Computer Architecture, pp. 1-12, (2017)
  • [19] Inside volta: The world's most advanced data-center GPU
  • [20] Sze V., Chen Y.H., Suleiman, A, Et al., Hardware for machine learning: Challenges and opportunities, Proc of the 30th IEEE Custom Integrated Circuits Conf., pp. 299-306, (2017)