Spatial Data Dependence Graph Based Pre-RTL Simulator for Convolutional Neural Network Dataflows

被引:4
作者
Wang, Jooho [1 ]
Park, Sungkyung [2 ]
Park, Chester Sungchung [1 ]
机构
[1] Konkuk Univ, Dept Elect & Elect Engn, Seoul 05029, South Korea
[2] Pusan Natl Univ, Dept Elect Engn, Pusan 46241, South Korea
关键词
Hardware acceleration; Memory management; Convolutional neural networks; Bandwidth; Spatial databases; Registers; Power demand; Convolutional neural networks (CNNs); data dependence graph; design space exploration (DSE); hardware accelerators; latency-insensitive controller; pre-RTL simulator; spatial data dependence graph (SDDG); ARCHITECTURE; PERFORMANCE; INFERENCE; COST; DRAM;
D O I
10.1109/ACCESS.2022.3146413
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, a new pre-RTL simulator is proposed to predict the power, performance, and area of convolutional neural network (CNN) dataflows prior to register-transfer-level (RTL) design. In the simulator, a novel approach is adopted to implement a spatial data dependence graph (SDDG), which enables us to model a specific dataflow alongside inter-instruction dependencies by tracking the status of each processing element (PE). In addition, the proposed pre-RTL simulator makes it possible to evaluate the impact of memory constraints such as latency and bandwidth. The latency-insensitive and bandwidth-insensitive PE controllers assumed in the proposed pre-RTL simulator guarantee both functional correctness and maximum performance, regardless of memory constraints. In particular, it is shown that the optimal distribution method of local memory bandwidth can reduce the accelerator execution time by up to 37.6% compared with the equal distribution method. For weight stationary (WS) and row stationary (RS) dataflows, the accelerator performance closely depends on memory constraints. The simulation results also show that the relative performances of dataflows depend on the layer shape of the convolutional layer. For example, for an identical hardware area in a standard convolutional layer of AlexNet, WS dataflows do not provide any performance gain over RS dataflows when the memory latency is sufficiently high. In addition, WS dataflows cannot fully reuse the input activation, thereby increasing local memory accesses, since the number of weights loaded at a specific time is limited. Moreover, in a depth-wise convolutional layer of MobileNet, WS dataflows tend to outperform RS dataflows even in the presence of large memory latency. The source code is available on the GitHub repository: https://github.com/SDL-KU/SDDGSim.
引用
收藏
页码:11382 / 11403
页数:22
相关论文
共 65 条
  • [31] Kim D, 2017, DES AUT TEST EUROPE, P1462, DOI 10.23919/DATE.2017.7927222
  • [32] Ramulator: A Fast and Extensible DRAM Simulator
    Kim, Yoongu
    Yang, Weikun
    Mutlu, Onur
    [J]. IEEE COMPUTER ARCHITECTURE LETTERS, 2016, 15 (01) : 45 - 49
  • [33] ImageNet Classification with Deep Convolutional Neural Networks
    Krizhevsky, Alex
    Sutskever, Ilya
    Hinton, Geoffrey E.
    [J]. COMMUNICATIONS OF THE ACM, 2017, 60 (06) : 84 - 90
  • [34] MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings
    Kwon, Hyoukjun
    Chatarasi, Prasanth
    Sarkar, Vivek
    Krishna, Tushar
    Pellauer, Michael
    Parashar, Angshuman
    [J]. IEEE MICRO, 2020, 40 (03) : 20 - 29
  • [35] Deep learning
    LeCun, Yann
    Bengio, Yoshua
    Hinton, Geoffrey
    [J]. NATURE, 2015, 521 (7553) : 436 - 444
  • [36] On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators
    Li, Haitong
    Bhargava, Mudit
    Whatmough, Paul N.
    Wong, H-S Philip
    [J]. PROCEEDINGS OF THE 2019 56TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2019,
  • [37] EnGN: A High-Throughput and Energy-Efficient Accelerator for Large Graph Neural Networks
    Liang, Shengwen
    Wang, Ying
    Liu, Cheng
    He, Lei
    Li, Huawei
    Xu, Dawen
    Li, Xiaowei
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (09) : 1511 - 1525
  • [38] OMNI: A Framework for Integrating Hardware and Software Optimizations for Sparse CNNs
    Liang, Yun
    Lu, Liqiang
    Xie, Jiaming
    [J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2021, 40 (08) : 1648 - 1661
  • [39] Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA
    Ma, Yufei
    Cao, Yu
    Vrudhula, Sarma
    Seo, Jae-sun
    [J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018, 26 (07) : 1354 - 1367
  • [40] Mantovani P., 2020, P ICCAD, P1