A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs

被引:0
作者
Ledoux, Louis [1 ]
Casas, Marc [1 ]
机构
[1] Univ Politecn Catalunya UPC, Barcelona Supercomp Ctr BSC, Barcelona, Spain
来源
2022 IEEE 30TH INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2022) | 2022年
关键词
STABILITY; DESIGN;
D O I
10.1109/FCCM53951.2022.9786164
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a hardware generator of GEMM accelerators. Our generator produces vendor-agnostic HDL describing highly customizable systolic arrays guided by accuracy and energy efficiency goals. The generated arrays have three main novel aspects. First, the accelerators handle a large variety of computer number formats using intermediate representations based on our Sign Scale Significand (S3) format. Second, the processing elements perform all intermediate dot-product arithmetic operations required by the GEMM kernel without any intermediate rounding, which makes it possible to deliver better energy efficiency than state-of-the-art approaches while offering more accuracy and reproducible results. Third, our accelerators feature the Half-Speed Sink Down (HSSD) mechanism, which maximizes the overlap of host-accelerator data transfers with GEMM computations. We evaluate our automatically generated designs in a cutting-edge setup composed of a POWER9 host, CAPI (Coherent Accelerator Processor Interface) link, and a Virtex Ultrascale Plus FPGA. Arrays can operate at the speed of the link and saturate it to reach a 13GB/s throughput. Our fine-grain customization approach allows to cover a wide range of accuracy versus efficiency scenarios and can reach 0.65GOps/s/W while producing 1024 accurate bits or 148.7GOps/s/W with 6 accurate bits. Our configurations achieve up to 1613GOps/s system performance and power efficiencies of up to 240GOps/s/W for the FPGA. This automatic generator is the first being able to produce such a variety of designs. We improve the single-precision energy efficiency of state-of-the-art FPGA GEMM accelerators by 1.86x.
引用
收藏
页码:200 / 209
页数:10
相关论文
共 67 条
[21]  
Genc H, 2021, Arxiv, DOI arXiv:1911.09925
[22]   RUN-LENGTH ENCODINGS [J].
GOLOMB, SW .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1966, 12 (03) :399-+
[23]  
google, SYSTEM ARCHITECTURE
[24]   FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates [J].
Guan, Yijin ;
Liang, Hao ;
Xu, Ningyi ;
Wang, Wenqiang ;
Shi, Shaoshuai ;
Chen, Xi ;
Sun, Guangyu ;
Zhang, Wei ;
Cong, Jason .
2017 IEEE 25TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2017), 2017, :152-159
[25]  
Gustafson J. L., BEATING FLOATING POI, P16
[26]  
He KM, 2015, Arxiv, DOI arXiv:1512.03385
[27]   Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications [J].
He, Y ;
Ding, CHQ .
JOURNAL OF SUPERCOMPUTING, 2001, 18 (03) :259-277
[28]  
Hrica J, 2012, FLOATING POINT DESIG, P13
[29]  
Iakymchuk R., 2015, Reproducible and Accurate Matrix Multiplication for GPU Accelerators
[30]  
IBM, 2019, CAPI SNAP FRAM HARDW