A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs

被引：0

作者：

Ledoux, Louis ^{[1
]}

Casas, Marc ^{[1
]}

机构：

[1] Univ Politecn Catalunya UPC, Barcelona Supercomp Ctr BSC, Barcelona, Spain

来源：

2022 IEEE 30TH INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2022) | 2022年

关键词：

STABILITY; DESIGN;

D O I：

10.1109/FCCM53951.2022.9786164

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

We propose a hardware generator of GEMM accelerators. Our generator produces vendor-agnostic HDL describing highly customizable systolic arrays guided by accuracy and energy efficiency goals. The generated arrays have three main novel aspects. First, the accelerators handle a large variety of computer number formats using intermediate representations based on our Sign Scale Significand (S3) format. Second, the processing elements perform all intermediate dot-product arithmetic operations required by the GEMM kernel without any intermediate rounding, which makes it possible to deliver better energy efficiency than state-of-the-art approaches while offering more accuracy and reproducible results. Third, our accelerators feature the Half-Speed Sink Down (HSSD) mechanism, which maximizes the overlap of host-accelerator data transfers with GEMM computations. We evaluate our automatically generated designs in a cutting-edge setup composed of a POWER9 host, CAPI (Coherent Accelerator Processor Interface) link, and a Virtex Ultrascale Plus FPGA. Arrays can operate at the speed of the link and saturate it to reach a 13GB/s throughput. Our fine-grain customization approach allows to cover a wide range of accuracy versus efficiency scenarios and can reach 0.65GOps/s/W while producing 1024 accurate bits or 148.7GOps/s/W with 6 accurate bits. Our configurations achieve up to 1613GOps/s system performance and power efficiencies of up to 240GOps/s/W for the FPGA. This automatic generator is the first being able to produce such a variety of designs. We improve the single-precision energy efficiency of state-of-the-art FPGA GEMM accelerators by 1.86x.

引用

页码：200 / 209

页数：10

共 67 条

[21]

Genc H, 2021, Arxiv, DOI arXiv:1911.09925

[22] RUN-LENGTH ENCODINGS [J].

GOLOMB, SW .

IEEE TRANSACTIONS ON INFORMATION THEORY, 1966, 12 (03) :399-+

[23]

google, SYSTEM ARCHITECTURE

[24] FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates [J].

Guan, Yijin ;

Liang, Hao ;

Xu, Ningyi ;

Wang, Wenqiang ;

Shi, Shaoshuai ;

Chen, Xi ;

Sun, Guangyu ;

Zhang, Wei ;

Cong, Jason .

2017 IEEE 25TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2017), 2017, :152-159

[25]

Gustafson J. L., BEATING FLOATING POI, P16

[26]

He KM, 2015, Arxiv, DOI arXiv:1512.03385

[27] Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications [J].

He, Y ;

Ding, CHQ .

JOURNAL OF SUPERCOMPUTING, 2001, 18 (03) :259-277

[28]

Hrica J, 2012, FLOATING POINT DESIG, P13

[29]

Iakymchuk R., 2015, Reproducible and Accurate Matrix Multiplication for GPU Accelerators

[30]

IBM, 2019, CAPI SNAP FRAM HARDW

← 1 2 3 4 5 6 7 →