Carat: Unlocking Value-Level Parallelism for Multiplier-Free GEMMs

被引:0
作者
Pan, Zhewen [1 ]
Miguel, Joshua San [1 ]
Wu, Di [2 ]
机构
[1] Univ Wisconsin Madison, Madison, WI 53706 USA
[2] Univ Cent Florida, Orlando, FL 32816 USA
来源
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, ASPLOS 2024, VOL 2 | 2024年
关键词
value-level parallelism; value reuse; temporal computing; low-precision; batch processing; multiplier-free;
D O I
10.1145/3620665.3640364
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, hardware architectures optimized for general matrix multiplication (GEMM) have been well studied to deliver better performance and e.ciency for deep neural networks. With trends towards batched, low-precision data, e.g., FP8 format in this work, we observe that there is growing untapped potential for value reuse. We propose a novel computing paradigm, value-level parallelism, whereby unique products are computed only once, and di.erent inputs subscribe to (select) their products via temporal coding. Our architecture, Carat, employs value-level parallelism and transforms multiplication into accumulation, performing GEMMs with e.cient multiplier-free hardware. Experiments show that, on average, Carat improves iso-area throughput and energy e.ciency by 1.02. and 1.06. over a systolic array and 3.2. and 4.3. when scaled up to multiple nodes.
引用
收藏
页码:167 / 184
页数:18
相关论文
共 77 条
[61]  
Shao YKS, 2019, MICRO'52: THE 52ND ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, P14, DOI 10.1145/3352460.3358302
[62]  
Shao YS, 2014, CONF PROC INT SYMP C, P97, DOI 10.1109/ISCA.2014.6853196
[63]  
Shen DG, 2017, ANNU REV BIOMED ENG, V19, P221, DOI [10.1146/annurev-bioeng-071516-044442, 10.1146/annurev-bioeng-071516044442]
[64]   Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis [J].
Shen, Haichen ;
Chen, Lequn ;
Jin, Yuchen ;
Zhao, Liangyu ;
Kong, Bingyu ;
Philipose, Matthai ;
Krishnamurthy, Arvind ;
Sundaram, Ravi .
PROCEEDINGS OF THE TWENTY-SEVENTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP '19), 2019, :322-337
[65]  
Sheng Ying, 2023, INT C MACH LEARN
[66]  
Sun Xiao, 2019, Advances in Neural Information Processing Systems, V32
[67]   Superconducting Computing with Alternating Logic Elements [J].
Tzimpragos, Georgios ;
Volk, Jennifer ;
Wynn, Alex ;
Smith, James E. ;
Sherwood, Timothy .
2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, :651-664
[68]   Boosted Race Trees for Low Energy Classification [J].
Tzimpragos, Georgios ;
Madhavan, Advait ;
Vasudevan, Dilip ;
Strukov, Dmitri ;
Sherwood, Timothy .
TWENTY-FOURTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXIV), 2019, :215-228
[69]   uGEMM: Unary Computing Architecture for GEMM Applications [J].
Wu, Di ;
Li, Jingjie ;
Yin, Ruokai ;
Hsiao, Hsuan ;
Kim, Younghyun ;
San Miguel, Joshua .
2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, :377-390
[70]  
Wu Di, 2021, INT C COMPUTER DESIG