DNNOPT: A Framework for Efficiently Selecting On-chip Memory Loop Optimizations of DNN Accelerators

被引：0

作者：

Ranawaka, Piyumal ^{[1
]}

Azhar, Muhammad Waqar ^{[1
]}

Stenstrom, Per ^{[1
]}

机构：

[1] Chalmers Univ Technol, Gothenburg, Sweden

来源：

PROCEEDINGS OF THE 21ST ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2024, CF 2024 | 2024年

基金：

瑞典研究理事会;

关键词：

DNN acceleration; Loop Re-Order; Loop Blocking; Reuse Distance; Energy Efficient DNN Acceleration; On-chip Memory Management; DESIGN SPACE EXPLORATION; PERFORMANCE; COMPILER; REUSE;

D O I：

10.1145/3649153.3649196

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Deep neural network (DNN) accelerators suffer from poor utilization of on-chip memory which potentially reduces performance and energy efficiency. Loop reordering and blocking are used to improve on-chip memory utilization in DNN accelerators. However, existing optimization frameworks are inefficient due to either a prohibitive time complexity of searching the entire search space or due to a sub-optimal choice of optimizations. This paper proposes DNNOPT - a hardware/software framework for optimally selecting loop order and blocking factors, for loop reordering and blocking in isolation or in combination. DNNOPT uses proposed Early exit and Strided search strategies to prune the search space and simple analytical models of data reuse to evaluate each optimization point efficiently and accurately. Overall, DNNOPT reduces the search space by more than two orders of magnitude and improves performance, energy efficiency and time to solution, on average, by 1.8x, 50%, and 226x, respectively, of convolutional neural network (CNN) and Transformer applications compared to state-of-the-art frameworks.

引用

页码：126 / 137

页数：12

共 24 条

[1] A framework for coarse-grain optimizations in the on-chip memory hierarchy
Zebchuk, Jason
Safi, Elham
Moshovos, Andreas
MICRO-40: PROCEEDINGS OF THE 40TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, 2007, : 314 - 326
[2] A framework for loop distribution on limited on-chip memory processors
Wang, L
Tembe, W
Pande, S
COMPILER CONSTRUCTION, PROCEEDINGS, 2000, 1781 : 141 - 156
[3] DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training
Peng, Xiaochen
Huang, Shanshi
Jiang, Hongwu
Lu, Anni
Yu, Shimeng
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2021, 40 (11) : 2306 - 2319
[4] Advances and Trends on On-Chip Compute-in-Memory Macros and Accelerators
Seo, Jae-sun
2023 60TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC, 2023,
[5] A building block for coarse-grain optimizations in the on-chip memory hierarchy
Zebchuk, Jason
Moshovos, Andreas
IEEE Computer Architecture Letters, 2007, 6 (02) : 33 - 36
[6] TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators
Maas, Martin
Beaugnon, Ulysse
Chauhan, Arun
Ilbeyi, Berkin
PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, VOL 1, ASPLOS 2023, 2023, : 123 - 137
[7] Multi-Bank On-Chip Memory Management Techniques for CNN Accelerators
Kang, Duseok
Kang, Donghyun
Ha, Soonhoi
IEEE TRANSACTIONS ON COMPUTERS, 2022, 71 (05) : 1181 - 1193
[8] Tolerating Soft Errors in Deep Learning Accelerators with Reliable On-Chip Memory Designs
Azizimazreah, Arash
Gu, Yongbin
Gu, Xiang
Chen, Lizhong
2018 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE AND STORAGE (NAS), 2018,
[9] A Case of On-Chip Memory Subsystem Design for Low-Power CNN Accelerators
Wang, Ying
Li, Huawei
Li, Xiaowei
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2018, 37 (10) : 1971 - 1984
[10] Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training
Kim, Jungwoo
Na, Seonjin
Lee, Sanghyeon
Lee, Sunho
Huh, Jaehyuk
56TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, MICRO 2023, 2023, : 438 - 451

← 1 2 3 →