Olympus: Reaching Memory-Optimality on DNN Processors

被引:4
作者
Cai, Xuyi [1 ,2 ]
Wang, Ying [3 ]
Tu, Kaijie [1 ]
Gao, Chengsi [1 ,2 ]
Zhang, Lei [1 ]
机构
[1] Chinese Acad Sci, Res Ctr Ubiquitous Comp Syst, Inst Comp Technol, Beijing 100080, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Chinese Acad Sci, Inst Comp Technol, State Key Inboratory Comp Architecture, Beijing 100080, Peoples R China
基金
中国国家自然科学基金;
关键词
Program processors; Processor scheduling; Scheduling; System-on-chip; Optimal scheduling; Computer architecture; Network architecture; DNN; memory; scheduling; processor;
D O I
10.1109/TC.2021.3112262
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In DNN processors, main memory consumes much more energy than arithmetic operations. Therefore, many memory-oriented network scheduling (MONS) techniques are introduced to exploit on-chip data reuse opportunities and reduce accesses to memory. However, to derive the theoretical lower bound of memory overhead for DNNs is still a significant challenge, which also sheds light on how to reach memory-level optimality by means of network scheduling. Prior work on MONS mainly focused on disparate optimization techniques or missed some of the data reusing opportunities in diverse network models, thus their results are likely to deviate from the true memory-optimality that can be achieved in processors. This paper introduces Olympus, which comprehensively considers the entire memory-level DNN scheduling space, formally analyzes the true memory-optimality and also how to reach the memory-optimal schedules for an arbitrary DNN running on a DNN processor. The key idea behind Olympus is to derive a true memory lower-bound regarding both the intra-layer and inter-layer reuse opportunities, which has not been simultaneously explored by prior works. Evaluation on SOTA DNN processors of different architectures shows that Olympus can guarantee the minimum off-chip memory access, and it reduces 12.3-85.6% DRAM access and saves 7.4-70.3% energy on the latest network models.
引用
收藏
页码:1939 / 1951
页数:13
相关论文
共 28 条
[1]  
Alwani M, 2016, INT SYMP MICROARCH
[2]  
[Anonymous], 2014, COMPUT RES REPOSITOR
[3]   CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories [J].
Balasubramonian, Rajeev ;
Kahng, Andrew B. ;
Muralimanohar, Naveen ;
Shafiee, Ali ;
Srinivas, Vaishnav .
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2017, 14 (02)
[4]  
Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579
[5]   DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning [J].
Chen, Tianshi ;
Du, Zidong ;
Sun, Ninghui ;
Wang, Jia ;
Wu, Chengyong ;
Chen, Yunji ;
Temam, Olivier .
ACM SIGPLAN NOTICES, 2014, 49 (04) :269-283
[6]   Communication Lower Bound in Convolution Accelerators [J].
Chen, Xiaoming ;
Han, Yinhe ;
Wang, Yu .
2020 IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2020), 2020, :529-541
[7]   Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks [J].
Chen, Yu-Hsin ;
Krishna, Tushar ;
Emer, Joel S. ;
Sze, Vivienne .
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2017, 52 (01) :127-138
[8]  
Cong Jason, 2014, Artificial Neural Networks and Machine Learning - ICANN 2014. 24th International Conference on Artificial Neural Networks. Proceedings: LNCS 8681, P281, DOI 10.1007/978-3-319-11179-7_36
[9]   ShiDianNao: Shifting Vision Processing Closer to the Sensor [J].
Du, Zidong ;
Fasthuber, Robert ;
Chen, Tianshi ;
Ienne, Paolo ;
Li, Ling ;
Luo, Tao ;
Feng, Xiaobing ;
Chen, Yunji ;
Temam, Olivier .
2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, :92-104
[10]   TETRIS: Scalable and efficient neural network acceleration with 3D memory [J].
Gao M. ;
Pu J. ;
Yang X. ;
Horowitz M. ;
Kozyrakis C. .
ACM SIGPLAN Notices, 2017, 52 (04) :751-764