Acorns: A Framework for Accelerating Deep Neural Networks with Input Sparsity

被引:7
作者
Dong, Xiao [1 ,2 ]
Liu, Lei [1 ]
Zhao, Peng [1 ,2 ]
Li, Guangli [1 ,2 ]
Li, Jiansong [1 ,2 ]
Wang, Xueying [1 ,2 ]
Feng, Xiaobing [1 ,2 ]
机构
[1] Chinese Acad Sci, State Key Lab Comp Architecture, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
来源
2019 28TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT 2019) | 2019年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Deep Learning; Sparse; Optimization; Compiler; MATRIX-VECTOR MULTIPLICATION; SPMV;
D O I
10.1109/PACT.2019.00022
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep neural networks have been employed in a broad range of applications, including face detection, natural language processing, and autonomous driving. Yet, the neural networks with the capability to tackle real-world problems are intrinsically expensive in computation, hindering the usage of these models. Sparsity in the input data of neural networks provides an optimizing opportunity. However, harnessing the potential performance improvement on modern CPU faces challenges raised by sparse computations of the neural network, such as cache-unfriendly memory accesses and efficient sparse kernel implementation. In this paper, we propose Acorns, a framework to accelerate deep neural networks with input sparsity. In Acorns, sparse input data is organized into our designed sparse data layout, which allows memory-friendly access for kernels in neural networks and opens the door for many performance-critical optimizations. Upon that, Acorns generates efficient sparse kernels for operators in neural networks from kernel templates, which combine directions that express specific optimizing transformations to be performed, and straightforward code that describes the computation. Comprehensive evaluations demonstrate Acorns can outperform state-of-the-art baselines by significant speedups. On the real-world detection task in autonomous driving, Acorns demonstrates 1.8-22.6x performance improvement over baselines. Specifically, the generated programs achieve 1.8-2.4x speedups over Intel MKL-DNN, 3.0-8.8x speedups over TensorFlow, and 11.1-13.2x speedups over Intel MKL-Sparse.
引用
收藏
页码:178 / 191
页数:14
相关论文
共 59 条
[1]  
Abadi M., 2015, P 12 USENIX S OPERAT
[2]   Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations [J].
Aktulga, Hasan Metin ;
Buluc, Aydin ;
Williams, Samuel ;
Yang, Chao .
2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
[3]  
[Anonymous], 2017, ARXIV170407724
[4]  
[Anonymous], 2018, CORR
[5]  
[Anonymous], IEEE COMPUTER VISION
[6]  
Barrett R., 1994, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, DOI [10.1137/1.9781611971538, DOI 10.1137/1.9781611971538]
[7]  
Bell N, 2009, STUDENTS GUIDE TO THE MA TESOL, P1
[8]  
Buluç A, 2009, SPAA'09: PROCEEDINGS OF THE TWENTY-FIRST ANNUAL SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, P233
[9]  
Chen T, 2018, IEEE CONF COMPUT
[10]  
Chen X., 2017, CVPR, DOI [DOI 10.1109/CVPR.2017.691, 10.1109/CVPR.2017.691]