Cascade: An Application Pipelining Toolkit for Coarse-Grained Reconfigurable Arrays

被引:0
|
作者
Melchert, Jackson [1 ]
Mei, Yuchen [1 ]
Koul, Kalhan [1 ]
Liu, Qiaoyi [1 ]
Horowitz, Mark [1 ]
Raina, Priyanka [1 ]
机构
[1] Stanford Univ, Dept Elect Engn, Stanford, CA 94305 USA
关键词
Pipeline processing; Registers; Delays; Routing; Integrated circuit interconnections; Field programmable gate arrays; Switches; Accelerator compilers; application pipelining; coarse-grained reconfigurable arrays (CGRAs); hardware accelerators;
D O I
10.1109/TCAD.2024.3390542
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
While coarse-grained reconfigurable arrays (CGRAs) have emerged as promising programmable accelerator architectures, they require automatic pipelining of applications during their compilation flow to achieve high performance. Current CGRA compilers either lack pipelining altogether resulting in low application performance, or perform exhaustive pipelining resulting in high power and resource consumption. We address these challenges by proposing Cascade, an end-to-end open-source application compiler for CGRAs that achieves both state-of-the-art performance and fast compilation times. The contributions of this work are: 1) a novel post place-and-route (PnR) application pipelining technique for CGRAs that accounts for interconnect hop delays during pipelining but in a unique way that avoids cyclic scheduling and PnR, 2) a register resource usage optimization technique that leverages the scheduling logic in CGRA memory tiles to minimize the number of register resources used during pipelining, and 3) an automated CGRA timing model generator, an application timing analysis tool, and a large set of existing and novel application pipelining techniques integrated into an end-to-end compilation flow. Cascade achieves 8- 34x lower critical path delay and 7- 190x lower energy- delay product (EDP) across a variety of dense image processing and machine learning workloads, and 3- 5.2x lower critical path delay and 2.5- 5.2x lower EDP on sparse workloads, compared to a compiler without pipelining. Cascade mitigates the performance and energy-efficiency drawbacks of existing CGRA compilers, and enables further research into CGRAs as flexible, yet competitive accelerator architectures.
引用
收藏
页码:3055 / 3067
页数:13
相关论文
共 50 条
  • [1] Improving Nested Loop Pipelining on Coarse-Grained Reconfigurable Architectures
    Yin, Shouyi
    Liu, Dajiang
    Peng, Yu
    Liu, Leibo
    Wei, Shaojun
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2016, 24 (02) : 507 - 520
  • [2] Software pipelining for coarse-grained reconfigurable instruction set processors
    Barat, F
    Jayapala, M
    de Beeck, PO
    Deconinck, G
    ASP-DAC/VLSI DESIGN 2002: 7TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE AND 15TH INTERNATIONAL CONFERENCE ON VLSI DESIGN, PROCEEDINGS, 2002, : 338 - 344
  • [3] OpenCGRA: Democratizing Coarse-Grained Reconfigurable Arrays
    Tan, Cheng
    Agostini, Nicolas Bohm
    Zhang, Jeff
    Minutoli, Marco
    Castellana, Vito Giovanni
    Xie, Chenhao
    Geng, Tong
    Li, Ang
    Barker, Kevin
    Tumeo, Antonino
    2021 IEEE 32ND INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS (ASAP 2021), 2021, : 149 - 155
  • [4] A Bimodal Scheduler for Coarse-Grained Reconfigurable Arrays
    Theocharis, Panagiotis
    De Sutter, Bjorn
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2016, 13 (02)
  • [5] Memory-Aware Application Mapping on Coarse-Grained Reconfigurable Arrays
    Kim, Yongjoo
    Lee, Jongeun
    Shrivastava, Aviral
    Yoon, Jonghee
    Paek, Yunheung
    HIGH PERFORMANCE EMBEDDED ARCHITECTURES AND COMPILERS, PROCEEDINGS, 2010, 5952 : 171 - +
  • [6] A coarse-grained reconfigurable computing architecture with loop self-pipelining
    DOU YongWU GuiMingXU JinHui ZHOU XingMing National Laboratory for Parallel Distributed ProcessingNational University of Defense TechnologyChangsha China
    ScienceinChina(SeriesF:InformationSciences), 2009, 52 (04) : 575 - 587
  • [7] Reusable context pipelining for low power coarse-grained reconfigurable architecture
    Kim, Yoonjin
    Mahapatra, Rabi N.
    2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 3379 - 3386
  • [9] A coarse-grained reconfigurable computing architecture with loop self-pipelining
    Yong Dou
    GuiMing Wu
    JinHui Xu
    XingMing Zhou
    Science in China Series F: Information Sciences, 2009, 52 : 575 - 587
  • [10] The implementation of a coarse-grained reconfigurable architecture with loop self-pipelining
    Dou, Yong
    Xu, Jinhui
    Wu, Guiming
    RECONFIGURABLE COMPUTING: ARCHITECTURES, TOOLS AND APPLICATIONS, 2007, 4419 : 155 - +