A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms

被引:58
作者
Weng, Jian [1 ]
Liu, Sihao [1 ]
Wang, Zhengrong [1 ]
Dadu, Vidushi [1 ]
Nowatzki, Tony [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA
来源
2020 IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2020) | 2020年
关键词
Spatial Architecture; Reconfigurable Accelerator; Software/Hardware Codesign; Digital Signal Processor; PARALLELISM; IMPLEMENTATION; DECOMPOSITION; COMPUTATION; SYSTEM; 5G;
D O I
10.1109/HPCA47549.2020.00063
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Dense linear algebra kernels are critical for wireless, and the oncoming proliferation of 5G only amplifies their importance. Due to the inductive nature of many such algorithms, parallelism is difficult to exploit: parallel regions have fine-grain producer/consumer interaction with iteratively changing dependence distance, reuse rate, and memory access patterns. This causes a high overhead both for multi-threading due to fine-grain synchronization, and for vectorization due to the non-rectangular iteration domains. CPUs, DSPs, and GPUs perform order-of-magnitude below peak. Our insight is that if the nature of inductive dependences and memory accesses were explicit in the hardware/software interface, then a spatial architecture could efficiently execute parallel code regions. To this end, we first extend the traditional dataflow model with first class primitives for inductive dependences and memory access patterns (streams). Second, we develop a hybrid spatial architecture combining systolic and dataflow execution to attain high utilization at low energy and area cost. Finally, we create a scalable design through a novel vector-stream control model which amortizes control overhead both in time and spatially across architecture lanes. We evaluate our design, REVEL, with a full stack (compiler, ISA, simulator, RTL). Across a suite of linear algebra kernels, REVEL outperforms equally-provisioned DSPs by 4.6x-37x. Compared to state-of-the-art spatial architectures, REVEL is mean 3.4x faster. Compared to a set of ASICs, REVEL is only 2x the power and half the area.
引用
收藏
页码:703 / 716
页数:14
相关论文
共 71 条
[1]   THE WARP COMPUTER - ARCHITECTURE, IMPLEMENTATION, AND PERFORMANCE [J].
ANNARATONE, M ;
ARNOULD, E ;
GROSS, T ;
KUNG, HT ;
LAM, M ;
MENZILCIOGLU, O ;
WEBB, JA .
IEEE TRANSACTIONS ON COMPUTERS, 1987, 36 (12) :1523-1538
[2]  
[Anonymous], [No title captured]
[3]  
[Anonymous], [No title captured]
[4]  
[Anonymous], 2017, 1 WORKSH COMP ARCH R
[5]  
[Anonymous], [No title captured]
[6]  
[Anonymous], 2013, DAC 2013
[7]  
Asanovic K., 2014, Rep. UCB/EECS-2014-146
[8]  
BILSEN G, 1995, INT CONF ACOUST SPEE, P3255, DOI 10.1109/ICASSP.1995.479579
[9]  
Binkert Nathan, 2011, Computer Architecture News, V39, P1, DOI 10.1145/2024716.2024718
[10]  
Budiu M., ASPLOS XI