A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs

被引：0

作者：

Ledoux, Louis ^{[1
]}

Casas, Marc ^{[1
]}

机构：

[1] Univ Politecn Catalunya UPC, Barcelona Supercomp Ctr BSC, Barcelona, Spain

来源：

2022 IEEE 30TH INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2022) | 2022年

关键词：

STABILITY; DESIGN;

D O I：

10.1109/FCCM53951.2022.9786164

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

We propose a hardware generator of GEMM accelerators. Our generator produces vendor-agnostic HDL describing highly customizable systolic arrays guided by accuracy and energy efficiency goals. The generated arrays have three main novel aspects. First, the accelerators handle a large variety of computer number formats using intermediate representations based on our Sign Scale Significand (S3) format. Second, the processing elements perform all intermediate dot-product arithmetic operations required by the GEMM kernel without any intermediate rounding, which makes it possible to deliver better energy efficiency than state-of-the-art approaches while offering more accuracy and reproducible results. Third, our accelerators feature the Half-Speed Sink Down (HSSD) mechanism, which maximizes the overlap of host-accelerator data transfers with GEMM computations. We evaluate our automatically generated designs in a cutting-edge setup composed of a POWER9 host, CAPI (Coherent Accelerator Processor Interface) link, and a Virtex Ultrascale Plus FPGA. Arrays can operate at the speed of the link and saturate it to reach a 13GB/s throughput. Our fine-grain customization approach allows to cover a wide range of accuracy versus efficiency scenarios and can reach 0.65GOps/s/W while producing 1024 accurate bits or 148.7GOps/s/W with 6 accurate bits. Our configurations achieve up to 1613GOps/s system performance and power efficiencies of up to 240GOps/s/W for the FPGA. This automatic generator is the first being able to produce such a variety of designs. We improve the single-precision energy efficiency of state-of-the-art FPGA GEMM accelerators by 1.86x.

引用

页码：200 / 209

页数：10

共 67 条

[61]

Wang S., BFloat16: The Secret to High Performance on Cloud TPUs

[62] Distributed MLPerf ResNet50 Training on Intel Xeon Architectures with TensorFlow [J].

Wang, Wei ;

Hasabnis, Niranjan .

PROCEEDINGS OF INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION WORKSHOPS (HPC ASIA 2021 WORKSHOPS), 2020, :29-35

[63]

Wikipedia, 2019, COHERENT ACCELERATOR

[64]

Wikipedia, 2020, PCI EXPRESS

[65]

Xilinx, 2016, VIRTEX ULTRASCALE

[66] Design of Power Efficient Posit Multiplier [J].

Zhang, Hao ;

Ko, Seok-Bum .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2020, 67 (05) :861-865

[67] QED corrections of O(mc(2)alpha(7)ln alpha) to the fine structure splittings of helium and He-like ions [J].

Zhang, T ;

Yan, ZC ;

Drake, GWF .

PHYSICAL REVIEW LETTERS, 1996, 77 (09) :1715-1718

← 1 2 3 4 5 6 7 →