Hardware Accelerator Integration Tradeoffs for High-Performance Computing: A Case Study of GEMM Acceleration in N-Body Methods

被引:4
作者
Asri, Mochamad [1 ]
Malhotra, Dhairya [2 ]
Wang, Jiajun [1 ]
Biros, George [3 ]
John, Lizy K. [1 ]
Gerstlauer, Andreas [1 ]
机构
[1] Univ Texas Austin, Elect & Comp Engn Dept, Austin, TX 78712 USA
[2] Flatiron Inst, New York, NY 10010 USA
[3] Univ Texas Austin, Inst Computat Engn & Sci, Austin, TX 78712 USA
关键词
System-on-chip; Acceleration; Random access memory; Optimization; Couplings; Computer architecture; Software;
D O I
10.1109/TPDS.2021.3056045
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In this article, we study performance and energy saving benefits of hardware acceleration under different hardware configurations and usage scenarios for a state-of-the-art Fast Multipole Method (FMM), which is a popular N-body method. We use a dedicated Application Specific Integrated Circuit (ASIC) to accelerate General Matrix-Matrix Multiply (GEMM) operations. FMM is widely used in applications and is representative example of the workload for many HPC applications. We compare architectures that integrate the GEMM ASIC next to, in or near main memory with an on-chip coupling aimed at minimizing or avoiding repeated round-trip transfers through DRAM for communication between accelerator and CPU. We study tradeoffs using detailed and accurately calibrated x86 CPU, accelerator and DRAM simulations. Our results show that simply moving accelerators closer to the chip does not necessarily lead to performance/energy gains. We demonstrate that, while careful software blocking and on-chip placement optimizations can reduce DRAM accesses by 2X over a naive on-chip integration, these dramatic savings in DRAM traffic do not automatically translate into significant total energy or runtime savings. This is chiefly due to the application characteristics, the high idle power and effective hiding of memory latencies in modern systems. Only when more aggressive co-optimizations such as software pipelining and overlapping are applied, additional performance and energy savings can be unlocked by 37 and 35 percent respectively over baseline acceleration. When similar optimizations (pipelining and overlapping) are applied with an off-chip integration, on-chip integration delivers up to 20 percent better performance and 17 percent less total energy consumption than off-chip integration.
引用
收藏
页码:2035 / 2048
页数:14
相关论文
共 53 条
  • [1] Compute Caches
    Aga, Shaizeen
    Jeloka, Supreet
    Subramaniyan, Arun
    Narayanasamy, Satish
    Blaauw, David
    Das, Reetuparna
    [J]. 2017 23RD IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2017, : 481 - 492
  • [2] A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing
    Ahn, Junwhan
    Hong, Sungpack
    Yoo, Sungjoo
    Mutlu, Onur
    Choi, Kiyoung
    [J]. 2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, : 105 - 117
  • [3] Cryptographic processors - A surrey
    Anderson, R
    Bond, M
    Clulow, J
    Skorobogatov, S
    [J]. PROCEEDINGS OF THE IEEE, 2006, 94 (02) : 357 - 369
  • [4] Asri M, 2016, PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING AND SIMULATION (SAMOS), P88, DOI 10.1109/SAMOS.2016.7818335
  • [5] The fast multipole algorithm
    Board, J
    Schulten, K
    [J]. COMPUTING IN SCIENCE & ENGINEERING, 2000, 2 (01) : 76 - 79
  • [6] p4est: SCALABLE ALGORITHMS FOR PARALLEL ADAPTIVE MESH REFINEMENT ON FORESTS OF OCTREES
    Burstedde, Carsten
    Wilcox, Lucas C.
    Ghattas, Omar
    [J]. SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2011, 33 (03) : 1103 - 1133
  • [7] A taxonomy of accelerator architectures and their programming models
    Cascaval, C.
    Chatterjee, S.
    Franke, H.
    Gildea, K. J.
    Pattnaik, P.
    [J]. IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2010, 54 (05)
  • [8] Computing Acceleration of FMM Algorithm on the Basis of FPGA and GPU
    Chai, Yahui
    Shen, Wenfeng
    Xu, Weimin
    Zheng, Yanheng
    [J]. MATERIALS PROCESSING TECHNOLOGY, PTS 1-4, 2011, 291-294 : 3272 - 3277
  • [9] Chandramowlishwarany Aparna., 2010, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, P1, DOI [10.1109/IPDPS.2010.5470415, DOI 10.1109/IPDPS.2010.5470415]
  • [10] DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning
    Chen, Tianshi
    Du, Zidong
    Sun, Ninghui
    Wang, Jia
    Wu, Chengyong
    Chen, Yunji
    Temam, Olivier
    [J]. ACM SIGPLAN NOTICES, 2014, 49 (04) : 269 - 283