A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs

被引:0
作者
Ledoux, Louis [1 ]
Casas, Marc [1 ]
机构
[1] Univ Politecn Catalunya UPC, Barcelona Supercomp Ctr BSC, Barcelona, Spain
来源
2022 IEEE 30TH INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2022) | 2022年
关键词
STABILITY; DESIGN;
D O I
10.1109/FCCM53951.2022.9786164
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a hardware generator of GEMM accelerators. Our generator produces vendor-agnostic HDL describing highly customizable systolic arrays guided by accuracy and energy efficiency goals. The generated arrays have three main novel aspects. First, the accelerators handle a large variety of computer number formats using intermediate representations based on our Sign Scale Significand (S3) format. Second, the processing elements perform all intermediate dot-product arithmetic operations required by the GEMM kernel without any intermediate rounding, which makes it possible to deliver better energy efficiency than state-of-the-art approaches while offering more accuracy and reproducible results. Third, our accelerators feature the Half-Speed Sink Down (HSSD) mechanism, which maximizes the overlap of host-accelerator data transfers with GEMM computations. We evaluate our automatically generated designs in a cutting-edge setup composed of a POWER9 host, CAPI (Coherent Accelerator Processor Interface) link, and a Virtex Ultrascale Plus FPGA. Arrays can operate at the speed of the link and saturate it to reach a 13GB/s throughput. Our fine-grain customization approach allows to cover a wide range of accuracy versus efficiency scenarios and can reach 0.65GOps/s/W while producing 1024 accurate bits or 148.7GOps/s/W with 6 accurate bits. Our configurations achieve up to 1613GOps/s system performance and power efficiencies of up to 240GOps/s/W for the FPGA. This automatic generator is the first being able to produce such a variety of designs. We improve the single-precision energy efficiency of state-of-the-art FPGA GEMM accelerators by 1.86x.
引用
收藏
页码:200 / 209
页数:10
相关论文
共 67 条
[61]  
Wang S., BFloat16: The Secret to High Performance on Cloud TPUs
[62]   Distributed MLPerf ResNet50 Training on Intel Xeon Architectures with TensorFlow [J].
Wang, Wei ;
Hasabnis, Niranjan .
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION WORKSHOPS (HPC ASIA 2021 WORKSHOPS), 2020, :29-35
[63]  
Wikipedia, 2019, COHERENT ACCELERATOR
[64]  
Wikipedia, 2020, PCI EXPRESS
[65]  
Xilinx, 2016, VIRTEX ULTRASCALE
[66]   Design of Power Efficient Posit Multiplier [J].
Zhang, Hao ;
Ko, Seok-Bum .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2020, 67 (05) :861-865
[67]   QED corrections of O(mc(2)alpha(7)ln alpha) to the fine structure splittings of helium and He-like ions [J].
Zhang, T ;
Yan, ZC ;
Drake, GWF .
PHYSICAL REVIEW LETTERS, 1996, 77 (09) :1715-1718