Carat: Unlocking Value-Level Parallelism for Multiplier-Free GEMMs

被引：0

作者：

Pan, Zhewen ^{[1
]}

Miguel, Joshua San ^{[1
]}

Wu, Di ^{[2
]}

机构：

[1] Univ Wisconsin Madison, Madison, WI 53706 USA

[2] Univ Cent Florida, Orlando, FL 32816 USA

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, ASPLOS 2024, VOL 2 | 2024年

关键词：

value-level parallelism; value reuse; temporal computing; low-precision; batch processing; multiplier-free;

D O I：

10.1145/3620665.3640364

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, hardware architectures optimized for general matrix multiplication (GEMM) have been well studied to deliver better performance and e.ciency for deep neural networks. With trends towards batched, low-precision data, e.g., FP8 format in this work, we observe that there is growing untapped potential for value reuse. We propose a novel computing paradigm, value-level parallelism, whereby unique products are computed only once, and di.erent inputs subscribe to (select) their products via temporal coding. Our architecture, Carat, employs value-level parallelism and transforms multiplication into accumulation, performing GEMMs with e.cient multiplier-free hardware. Experiments show that, on average, Carat improves iso-area throughput and energy e.ciency by 1.02. and 1.06. over a systolic array and 3.2. and 4.3. when scaled up to multiple nodes.

引用

页码：167 / 184

页数：18

共 77 条

[61]

Shao YKS, 2019, MICRO'52: THE 52ND ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, P14, DOI 10.1145/3352460.3358302

[62]

Shao YS, 2014, CONF PROC INT SYMP C, P97, DOI 10.1109/ISCA.2014.6853196

[63]

Shen DG, 2017, ANNU REV BIOMED ENG, V19, P221, DOI [10.1146/annurev-bioeng-071516-044442, 10.1146/annurev-bioeng-071516044442]

[64] Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis [J].

Shen, Haichen ;

Chen, Lequn ;

Jin, Yuchen ;

Zhao, Liangyu ;

Kong, Bingyu ;

Philipose, Matthai ;

Krishnamurthy, Arvind ;

Sundaram, Ravi .

PROCEEDINGS OF THE TWENTY-SEVENTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP '19), 2019, :322-337

[65]

Sheng Ying, 2023, INT C MACH LEARN

[66]

Sun Xiao, 2019, Advances in Neural Information Processing Systems, V32

[67] Superconducting Computing with Alternating Logic Elements [J].

Tzimpragos, Georgios ;

Volk, Jennifer ;

Wynn, Alex ;

Smith, James E. ;

Sherwood, Timothy .

2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, :651-664

[68] Boosted Race Trees for Low Energy Classification [J].

Tzimpragos, Georgios ;

Madhavan, Advait ;

Vasudevan, Dilip ;

Strukov, Dmitri ;

Sherwood, Timothy .

TWENTY-FOURTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXIV), 2019, :215-228

[69] uGEMM: Unary Computing Architecture for GEMM Applications [J].

Wu, Di ;

Li, Jingjie ;

Yin, Ruokai ;

Hsiao, Hsuan ;

Kim, Younghyun ;

San Miguel, Joshua .

2020 ACM/IEEE 47TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2020), 2020, :377-390

[70]

Wu Di, 2021, INT C COMPUTER DESIG

← 1 2 3 4 5 6 7 8 →