dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators

被引:68
作者
Dave, Shail [1 ]
Kim, Youngbin [2 ]
Avancha, Sasikanth [3 ]
Lee, Kyoungwoo [2 ]
Shrivastava, Aviral [1 ]
机构
[1] Arizona State Univ, Compiler Microarchitecture Lab, Sch Comp Informat & Decis Syst Engn, Tempe, AZ 85281 USA
[2] Yonsei Univ, Seoul, South Korea
[3] Intel Labs, Parallel Comp Lab, Bangalore, Karnataka, India
基金
新加坡国家研究基金会; 美国国家科学基金会;
关键词
Coarse-grained reconfigurable array; dataflow; deep neural networks; loop optimization; energy-efficiency; systolic arrays; mapping; analytical model; design space exploration; SPACE EXPLORATION;
D O I
10.1145/3358198
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner - to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16x better in Energy-Delay-Product (EDP) and 5.83x better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.
引用
收藏
页数:27
相关论文
共 43 条
[1]  
[Anonymous], COMPILER OPTIMIZATIO
[2]  
[Anonymous], 2019, ARXIV190201492
[3]  
[Anonymous], 2017, IEEE J SOLID STATE C
[4]  
[Anonymous], DNN ENERGY MODEL OPT
[5]  
[Anonymous], 2007, COMPILERS PRINCIPLES
[6]  
[Anonymous], ARXIV171107606
[7]  
[Anonymous], P IEEE C COMP VIS PA
[8]  
[Anonymous], ADV NEURAL INFORM PR
[9]  
[Anonymous], 2018, ABS180502566 CORR
[10]  
[Anonymous], 2019, IEEE T PARALLEL DIST