Optimus: An Operator Fusion Framework for Deep Neural Networks

被引:6
作者
Cai, Xuyi [1 ,2 ]
Wang, Ying [3 ,4 ]
Zhang, Lei [1 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Chinese Acad Sci, Zhejiang Lab, Beijing, Peoples R China
[4] Chinese Acad Sci, Inst Comp Technol, State Key Lab Comp Architecture, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Neural network; embedded processor; memory; layer fusion;
D O I
10.1145/3520142
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The reduction of neural parameters and operations for the applications on embedded and IoT platforms in current deep neural network (DNN) architectures has received increasing attention. Relatively, the intermediate feature maps of such lightweight neural networks begin to grow and usually outsize the on-chip memory as the new bottleneck, which introduces considerable power-consuming off-chip memory accesses. To reduce the feature-induced memory accesses, operator fusion has been proposed to parallelize the execution of multiple convolutional layers and shown significant reduction of off-chip memory accesses. However, how to fuse the neural operators is still a challenging issue that heavily depends on both the neural network (NN) topology and the specific DNN accelerator configuration. In this work, we observed prior operator fusion approaches fail to guarantee memory-level optimality as they search in the constrained operator fusion design space. Considering the complexity of the NN topologies and the constrained resources of the DNN accelerators, we develop a novel operator fusion framework, Optimus. Optimus includes an accurate memory cost model dedicated to the scheduler to evaluate the potential operator-fusion schemes and a directed acyclic graph-based operator fusion algorithm for both off-line and on-line workload deployment scenarios, which altogether generates high-efficiency operator-fusion solutions for arbitrary networkmodels running on DNN accelerators. The experimental results show that Optimus reduces 17-75% off-chip memory accesses and obtains 1.86x-3.66x energy efficiency on state-of-the-art DNN workloads when compared to the baselines and brings significant power-efficiency boost to the DNN accelerators of different architectures and dataflows.
引用
收藏
页数:26
相关论文
共 53 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
Alwani M, 2016, INT SYMP MICROARCH
[3]   CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories [J].
Balasubramonian, Rajeev ;
Kahng, Andrew B. ;
Muralimanohar, Naveen ;
Shafiee, Ali ;
Srinivas, Vaishnav .
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2017, 14 (02)
[4]  
Boehm M, 2018, Arxiv, DOI arXiv:1801.00829
[5]   Optimus: Towards Optimal Layer-Fusion on Deep Learning Processors [J].
Cai, Xuyi ;
Wang, Ying ;
Zhang, Lei .
LCTES '21: PROCEEDINGS OF THE 22ND ACM SIGPLAN/SIGBED INTERNATIONAL CONFERENCE ON LANGUAGES, COMPILERS, AND TOOLS FOR EMBEDDED SYSTEMS, 2021, :67-79
[6]   DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning [J].
Chen, Tianshi ;
Du, Zidong ;
Sun, Ninghui ;
Wang, Jia ;
Wu, Chengyong ;
Chen, Yunji ;
Temam, Olivier .
ACM SIGPLAN NOTICES, 2014, 49 (04) :269-283
[7]   Communication Lower Bound in Convolution Accelerators [J].
Chen, Xiaoming ;
Han, Yinhe ;
Wang, Yu .
2020 IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2020), 2020, :529-541
[8]   Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks [J].
Chen, Yu-Hsin ;
Emer, Joel ;
Sze, Vivienne .
2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :367-379
[9]   DaDianNao: A Machine-Learning Supercomputer [J].
Chen, Yunji ;
Luo, Tao ;
Liu, Shaoli ;
Zhang, Shijin ;
He, Liqiang ;
Wang, Jia ;
Li, Ling ;
Chen, Tianshi ;
Xu, Zhiwei ;
Sun, Ninghui ;
Temam, Olivier .
2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, :609-622
[10]   ShiDianNao: Shifting Vision Processing Closer to the Sensor [J].
Du, Zidong ;
Fasthuber, Robert ;
Chen, Tianshi ;
Ienne, Paolo ;
Li, Ling ;
Luo, Tao ;
Feng, Xiaobing ;
Chen, Yunji ;
Temam, Olivier .
2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, :92-104