Fast Batched Matrix Multiplication for Small Sizes using Half-Precision Arithmetic on GPUs

被引：27

作者：

Abdelfattah, Ahmad ^{[1
]}

Tomov, Stanimire ^{[1
]}

Dongarra, Jack ^{[2
,3
,4
]}

机构：

[1] Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA

[2] Univ Tennessee, Knoxville, TN 37996 USA

[3] Oak Ridge Natl Lab, Oak Ridge, TN USA

[4] Univ Manchester, Manchester, Lancs, England

来源：

2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019) | 2019年

关键词：

Matrix multiplication; batched linear algebra; FP16; arithmetic; GPU computing; LINEAR ALGEBRA;

D O I：

10.1109/IPDPS.2019.00022

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2 x and 10 x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.

引用

页码：111 / 122

页数：12

共 31 条

[1] High-Performance Tensor Contractions for GPUs [J].

Abdelfattah, A. ;

Baboulin, M. ;

Dobrev, V. ;

Dongarra, J. ;

Earl, C. ;

Falcou, J. ;

Haidar, A. ;

Karlin, I. ;

Kolev, Tz. ;

Masliah, I. ;

Tomov, S. .

INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE 2016 (ICCS 2016), 2016, 80 :108-118

[2] Performance, Design, and Autotuning of Batched GEMM for GPUs [J].

Abdelfattah, Ahmad ;

Haidar, Azzam ;

Tomov, Stanimire ;

Dongarra, Jack .

HIGH PERFORMANCE COMPUTING, 2016, 9697 :21-38

[3]

Agullo Emmanuel, 2010, GPU COMPUTING GEMS, V2

[4]

[Anonymous], PARALLEL COMPUTING

[5]

[Anonymous], 2019, IEEE Std 754-2019 (Revision of IEEE 754-2008), P1, DOI [DOI 10.1109/IEEESTD.2008.4610935, 10.1109/IEEESTD.2017.8091139, 10.1109/IEEESTD.2019.8766229, DOI 10.1109/IEEESTD.2019.8766229]

[6]

[Anonymous], 2015, FULL WALK SGEMM IMPL

[7]

[Anonymous], 2013, CORR

[8]

Chellapilla K., 2006, 10 INT WORKSH FRONT

[9]

Chetlur S., 2014, cudnn: Efficient primitives for deep learning

[10]

Gupta S, 2015, PR MACH LEARN RES, V37, P1737

← 1 2 3 4 →