A Low-Cost Floating-Point Dot-Product-Dual-Accumulate Architecture for HPC-Enabled AI

被引:2
作者
Tan, Hongbing [1 ]
Huang, Libo [1 ]
Zheng, Zhong [1 ]
Guo, Hui [1 ]
Yang, Qianmin [1 ]
Shen, Li [1 ]
Chen, Gang [2 ]
Xiao, Liquan [1 ]
Xiao, Nong
机构
[1] Natl Univ Def Technol, Coll Comp Sci & Technol, Changsha 410073, Peoples R China
[2] Sun Yat sen Univ, Sch Data & Comp Sci, Guangzhou 510006, Peoples R China
关键词
Dot-product-dual-accumulate (DPDAC); fused multiply-add; high-performance computing (HPC)-enabled artificial intelligence (AI); mixed-precision; numerical precision conversion; transprecision computing; FUSED-MULTIPLY-ADD; PERFORMANCE;
D O I
10.1109/TCAD.2023.3316994
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The dot-product Sigma( N)(i=1) A(i) x B-i is one of the most frequently used operations for a wide variety of high-performance computing (HPC) and artificial intelligence (AI) applications. However, for large-scale algorithms, such as acrshort GEMM and acrshort FFT, independent additions are necessary to accumulate the results of length-limited dot-product in order to form the final result, thus increasing latency and overhead. Hence, we proposed a dot-product-dual-accumulate (DPDAC) architecture capable of performing (Sigma( N=1,2,4 )(i=1)A(i) x B-i + Sigma C-M=1,2 (j=1)j) on a wide range of formats. The proposed architecture supports both single-path and dual-path execution. The single path is designed for performing acrshort DP acrshort FMA or DPDAC of lower formats, while dual-path supports parallel operations for single-precision (SP) addition and 2-term SP or acrshort TF32 dot-product or 4-term acrshort HP or BF16 dot-product. Moreover, numerical precision conversion is also supported by the proposed architecture, allowing for the conversion of numbers to higher or lower formats. The proposed DPDAC has been demonstrated to significantly reduce the overhead in comparison to discrete designs that utilize multiple single-mode acrshort FP units to achieve the same functionalities. Furthermore, when compared to the state-of-the-art multiple-precision designs, the proposed architecture has been shown to support a wide range of formats and a greater variety of operations with lower costs.
引用
收藏
页码:681 / 693
页数:13
相关论文
共 36 条
[1]   Cascade Lake: Next Generation Intel Xeon Scalable Processor [J].
Arafa, Mohamed ;
Fahim, Bahaa ;
Kottapalli, Sailesh ;
Kumar, Akhilesh ;
Looi, Lily P. ;
Mandava, Sreenivas ;
Rudoff, Andy ;
Steiner, Ian M. ;
Valentine, Bob ;
Vedaraman, Geetha ;
Vora, Sujal .
IEEE MICRO, 2019, 39 (02) :29-36
[2]   Efficient dual-precision floating-point fused-multiply-add architecture [J].
Arunachalam, V. ;
Raj, Alex Noel Joseph ;
Hampannavar, Naveen ;
Bidul, C. B. .
MICROPROCESSORS AND MICROSYSTEMS, 2018, 57 :23-31
[3]   Accelerating scientific computations with mixed precision algorithms [J].
Baboulin, Marc ;
Buttari, Alfredo ;
Dongarra, Jack ;
Kurzak, Jakub ;
Langou, Julie ;
Langou, Julien ;
Luszczek, Piotr ;
Tomov, Stanimire .
COMPUTER PHYSICS COMMUNICATIONS, 2009, 180 (12) :2526-2533
[4]   A SIGNED BINARY MULTIPLICATION TECHNIQUE [J].
BOOTH, AD .
QUARTERLY JOURNAL OF MECHANICS AND APPLIED MATHEMATICS, 1951, 4 (02) :236-240
[5]  
Brunie N, 2011, CONF REC ASILOMAR C, P165, DOI 10.1109/ACSSC.2011.6189977
[6]   DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning [J].
Chen, Tianshi ;
Du, Zidong ;
Sun, Ninghui ;
Wang, Jia ;
Wu, Chengyong ;
Chen, Yunji ;
Temam, Olivier .
ACM SIGPLAN NOTICES, 2014, 49 (04) :269-283
[7]  
Dan Z., 2008, IEEEStandard 754-2008
[8]   Multi-functional floating-point MAF designs with dot product support [J].
Gok, Mustafa ;
Ozbilen, Metin Mete .
MICROELECTRONICS JOURNAL, 2008, 39 (01) :30-43
[9]  
Haidar A, 2018, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18)
[10]  
Hauser J.R., 2018, BERKELEY TESTFLOAT