Roofline-Model-Based Design Space Exploration for Dataflow Techniques of CNN Accelerators

被引:16
作者
Park, Chan [1 ]
Park, Sungkyung [1 ]
Park, Chester Sungchung [2 ]
机构
[1] Pusan Natl Univ, Dept Elect Engn, Pusan 46241, South Korea
[2] Konkuk Univ, Dept Elect Engn, Seoul 05029, South Korea
关键词
Computational modeling; Bandwidth; Space exploration; Hardware; Convolution; Memory management; Accelerator; convolutional neural networks (CNNs); dataflow techniques; roofline; simulation; processing element (PE); design space exploration (DSE); field-programmable gate array (FPGA);
D O I
10.1109/ACCESS.2020.3025550
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To effectively compute convolutional layers, a complex design space must exist (e.g., the dataflow techniques associated with the layer parameters, loop transformation techniques, and hardware parameters). For efficient design space exploration (DSE) of various dataflow techniques, namely, the weight-stationary (WS), output-stationary (OS), row-stationary (RS), and no local reuse (NLR) techniques, the processing element (PE) structure and computational pattern of each dataflow technique are analyzed. Various performance metrics are calculated, namely, the throughput (in giga-operations per second, GOPS), computation-to-communication ratio (CCR), on-chip memory usage, and off-chip memory bandwidth, as closed-form expressions of the layer and hardware parameters. In addition, loop interchange and loop unrolling techniques with a double-buffer architecture are assumed. Many roofline model-based simulations are performed to explore relevant dataflow techniques for a wide variety of convolutional layers of typical neural networks. Through simulation, this paper provides insights into the trends in accelerator performance as the layer parameters change. For convolutional layers with large input and output feature map (ifmap and ofmap) widths and heights, the GOPS of the NLR dataflow technique tends to be higher than that of the techniques. For convolutional layers with low weight and ofmap widths and heights, the RS dataflow technique achieves optimal GOPS and on-chip memory usage. In the case of convolutional layers with small weight widths and heights, the GOPS of the WS dataflow technique tends to be high. In the case of convolutional layers with small ofmap widths and heights, the OS dataflow technique achieves optimal GOPS and on-chip memory usage.
引用
收藏
页码:172509 / 172523
页数:15
相关论文
共 31 条
[1]  
[Anonymous], 2017, ARXIV171208934
[2]  
[Anonymous], 2018, IEEE INT J ARXIV
[3]  
[Anonymous], 2014, ASPLOS 14, DOI 10.1145/2541940
[4]   Fast and Efficient Convolutional Accelerator for Edge Computing [J].
Ardakani, Arash ;
Condo, Carlo ;
Gross, Warren J. .
IEEE TRANSACTIONS ON COMPUTERS, 2020, 69 (01) :138-152
[5]   A Programmable Parallel Accelerator for Learning and Classification [J].
Cadambi, Srihari ;
Majumdar, Abhinandan ;
Becchi, Michela ;
Chakradhar, Srimat ;
Graf, Hans Peter .
PACT 2010: PROCEEDINGS OF THE NINETEENTH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, 2010, :273-283
[6]   Origami: A 803-GOp/s/W Convolutional Network Accelerator [J].
Cavigelli, Lukas ;
Benini, Luca .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2017, 27 (11) :2461-2475
[7]   Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks [J].
Chen, Yu-Hsin ;
Emer, Joel ;
Sze, Vivienne .
2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :367-379
[8]  
Chen YH, 2016, ISSCC DIG TECH PAP I, V59, P262, DOI 10.1109/ISSCC.2016.7418007
[9]   DaDianNao: A Machine-Learning Supercomputer [J].
Chen, Yunji ;
Luo, Tao ;
Liu, Shaoli ;
Zhang, Shijin ;
He, Liqiang ;
Wang, Jia ;
Li, Ling ;
Chen, Tianshi ;
Xu, Zhiwei ;
Sun, Ninghui ;
Temam, Olivier .
2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, :609-622
[10]   Domain-Specific Hardware Accelerators [J].
Dally, William J. ;
Turakhia, Yatish ;
Han, Song .
COMMUNICATIONS OF THE ACM, 2020, 63 (07) :48-57