Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

被引:21
作者
Hawks, Benjamin [1 ]
Duarte, Javier [2 ]
Fraser, Nicholas J. [3 ]
Pappalardo, Alessandro [3 ]
Nhan Tran [1 ,4 ]
Umuroglu, Yaman [3 ]
机构
[1] Fermilab Natl Accelerator Lab, POB 500, Batavia, IL 60510 USA
[2] Univ Calif San Diego, La Jolla, CA 92093 USA
[3] Xilinx Res, Dublin, Ireland
[4] Northwestern Univ, Evanston, IL USA
来源
FRONTIERS IN ARTIFICIAL INTELLIGENCE | 2021年 / 4卷
基金
美国能源部;
关键词
pruning; quantization; neural networks; generalizability; regularization; batch normalization; MODEL COMPRESSION; ACCELERATION;
D O I
10.3389/frai.2021.676564
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantizationaware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.
引用
收藏
页数:15
相关论文
共 66 条
  • [1] [Anonymous], 2020, **DATA OBJECT**, DOI 10.5281/zenodo.3602260
  • [2] [Anonymous], 2010, ICML
  • [3] Balandat M., 2020, ADV NEUR IN, V33
  • [4] Balandat M., 2020, ADV NEURAL INFORM PR, V33, P21524
  • [5] Banner R., 2019, P ADV NEUR INF PROC, P1
  • [6] UNIQ: Uniform Noise Injection for Non-Uniform Quantization of Neural Networks
    Baskin, Chaim
    Liss, Natan
    Schwartz, Eli
    Zheltonozhskii, Evgenii
    Giryes, Raja
    Bronstein, Alex M.
    Mendelson, Avi
    [J]. ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2021, 37 (1-4): : 1 - 4
  • [7] Blalock J. J. Gonzalez, 2020, Mach. Learn. Syst., V2, P129
  • [8] FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
    Blott, Michaela
    Preusser, Thomas B.
    Fraser, Nicholas J.
    Gambardella, Giulio
    O'Brien, Kenneth
    Umuroglu, Yaman
    Leeser, Miriam
    Vissers, Kees
    [J]. ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2018, 11 (03)
  • [9] Chang SE, 2021, INT S HIGH PERF COMP, P208, DOI [10.1109/HPCA51647.2021.00027, 10.1109/WRCSARA53879.2021.9612678]
  • [10] Model Compression and Acceleration for Deep Neural Networks The principles, progress, and challenges
    Cheng, Yu
    Wang, Duo
    Zhou, Pan
    Zhang, Tao
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2018, 35 (01) : 126 - 136