Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

被引:238
作者
Yu, Jiecao [1 ]
Lukefahr, Andrew [1 ]
Palframan, David [2 ]
Dasika, Ganesh [2 ]
Das, Reetuparna [1 ]
Mahlke, Scott [1 ]
机构
[1] Univ Michigan, Ann Arbor, MI 48109 USA
[2] ARM, Cambridge, England
来源
44TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2017) | 2017年
基金
美国国家科学基金会;
关键词
neural network pruning; hardware parallelism; single instruction; multiple data;
D O I
10.1145/3079856.3080215
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As the size of Deep Neural Networks (DNNs) continues to grow to increase accuracy and solve more complex problems, their energy footprint also scales. Weight pruning reduces DNN model size and the computation by removing redundant weights. However, we implemented weight pruning for several popular networks on a variety of hardware platforms and observed surprising results. For many networks, the network sparsity caused by weight pruning will actually hurt the overall performance despite large reductions in the model size and required multiply-accumulate operations. Also, encoding the sparse format of pruned networks incurs additional storage space overhead. To overcome these challenges, we propose Scalpel that customizes DNN pruning to the underlying hardware by matching the pruned network structure to the data-parallel hardware organization. Scalpel consists of two techniques: SIMD-aware weight pruning and node pruning. For low-parallelism hardware (e.g., microcontroller), SIMD-aware weight pruning maintains weights in aligned fixed-size groups to fully utilize the SIMD units. For high-parallelism hardware (e.g., GPU), node pruning removes redundant nodes, not redundant weights, thereby reducing computation without sacrificing the dense matrix format. For hardware with moderate parallelism (e.g., desktop CPU), SIMD-aware weight pruning and node pruning are synergistically applied together. Across the microcontroller, CPU and GPU, Scalpel achieves mean speedups of 3.54x, 2.61x, and 1.25x while reducing the model sizes by 88%, 82%, and 53%. In comparison, traditional weight pruning achieves mean speedups of 1.90x, 1.06x, 0.41x across the three platforms.
引用
收藏
页码:548 / 560
页数:13
相关论文
共 43 条
  • [1] Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing
    Albericio, Jorge
    Judd, Patrick
    Hetherington, Tayler
    Aamodt, Tor
    Jerger, Natalie Enright
    Moshovos, Andreas
    [J]. 2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, : 1 - 13
  • [2] [Anonymous], P ISCA
  • [3] [Anonymous], 2016, P ISCA
  • [4] [Anonymous], 2016, BinaryNet: Training deep neural networks with weights and activa
  • [5] [Anonymous], 2016, ARXIV160201528
  • [6] [Anonymous], 2015, ARXIV151000149
  • [7] [Anonymous], ARXIV150602626
  • [8] [Anonymous], 2015, ARXIV150909308
  • [9] [Anonymous], 2013, NIPS
  • [10] [Anonymous], 2014, ARXIV NEURAL EVOLUTI