Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

被引:644
作者
Deng, Lei [1 ,2 ]
Li, Guoqi [1 ,3 ]
Han, Song [4 ]
Shi, Luping [1 ,3 ]
Xie, Yuan [2 ]
机构
[1] Tsinghua Univ, Ctr Brain Inspired Comp Res, Dept Precis Instrument, Beijing 100084, Peoples R China
[2] Univ Calif Santa Barbara, Dept Elect & Comp Engn, Santa Barbara, CA 93106 USA
[3] Tsinghua Univ, Beijing Innovat Ctr Future Chip, Beijing 100084, Peoples R China
[4] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA
基金
美国国家科学基金会;
关键词
Neural networks; Tensor decomposition; Data quantization; Acceleration; Program processors; Machine learning; Task analysis; Compact neural network; data quantization; neural network acceleration; neural network compression; sparse neural network; tensor decomposition; SINGULAR-VALUE DECOMPOSITION; TENSOR DECOMPOSITIONS; MEMORY; TRAIN; COMPUTATION; ENERGY; ARCHITECTURES; PREDICTION; ACCURACY; CNN;
D O I
10.1109/JPROC.2020.2976475
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Domain-specific hardware is becoming a promising topic in the backdrop of improvement slow down for general-purpose processors due to the foreseeable end of Moore's Law. Machine learning, especially deep neural networks (DNNs), has become the most dazzling domain witnessing successful applications in a wide spectrum of artificial intelligence (AI) tasks. The incomparable accuracy of DNNs is achieved by paying the cost of hungry memory consumption and high computational complexity, which greatly impedes their deployment in embedded systems. Therefore, the DNN compression concept was naturally proposed and widely used for memory saving and compute acceleration. In the past few years, a tremendous number of compression techniques have sprung up to pursue a satisfactory tradeoff between processing efficiency and application accuracy. Recently, this wave has spread to the design of neural network accelerators for gaining extremely high performance. However, the amount of related works is incredibly huge and the reported approaches are quite divergent. This research chaos motivates us to provide a comprehensive survey on the recent advances toward the goal of efficient compression and execution of DNNs without significantly compromising accuracy, involving both the high-level algorithms and their applications in hardware design. In this article, we review the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification. We explain their compression principles, evaluation metrics, sensitivity analysis, and joint-way use. Then, we answer the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures. In the end, we discuss several existing issues such as fair comparison, testing workloads, automatic compression, influence on security, and framework/hardware-level support, and give promising topics in this field and the possible challenges as well. This article attempts to enable readers to quickly build up a big picture of neural network compression and acceleration, clearly evaluate various methods, and confidently get started in the right way.
引用
收藏
页码:485 / 532
页数:48
相关论文
共 307 条
[1]  
Achterhold J., 2018, P ICLR
[2]   NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps [J].
Aimar, Alessandro ;
Mostafa, Hesham ;
Calabrese, Enrico ;
Rios-Navarro, Antonio ;
Tapiador-Morales, Ricardo ;
Lungu, Iulia-Alexandra ;
Milde, Moritz B. ;
Corradi, Federico ;
Linares-Barranco, Alejandro ;
Liu, Shih-Chii ;
Delbruck, Tobi .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2019, 30 (03) :644-656
[3]   A modified learning algorithm for the multilayer neural network with multi-valued neurons based on the complex QR decomposition [J].
Aizenberg, Igor ;
Luchetta, Antonio ;
Manetti, Stefano .
SOFT COMPUTING, 2012, 16 (04) :563-575
[4]  
Al Bahou A, 2018, PROC IEEE COOL CHIPS
[5]   Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing [J].
Albericio, Jorge ;
Judd, Patrick ;
Hetherington, Tayler ;
Aamodt, Tor ;
Jerger, Natalie Enright ;
Moshovos, Andreas .
2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, :1-13
[6]   Equivalent-accuracy accelerated neural-network training using analogue memory [J].
Ambrogio, Stefano ;
Narayanan, Pritish ;
Tsai, Hsinyu ;
Shelby, Robert M. ;
Boybat, Irem ;
di Nolfo, Carmelo ;
Sidler, Severin ;
Giordano, Massimo ;
Bodini, Martina ;
Farinha, Nathan C. P. ;
Killeen, Benjamin ;
Cheng, Christina ;
Jaoudi, Yassine ;
Burr, Geoffrey W. .
NATURE, 2018, 558 (7708) :60-+
[7]   BRein Memory: A Single-Chip Binary/Ternary Reconfigurable in-Memory Deep Neural Network Accelerator Achieving 1.4 TOPS at 0.6 W [J].
Ando, Kota ;
Ueyoshi, Kodai ;
Orimo, Kentaro ;
Yonekawa, Haruyoshi ;
Sato, Shimpei ;
Nakahara, Hiroki ;
Takamaeda-Yamazaki, Shinya ;
Ikebe, Masayuki ;
Asai, Tetsuya ;
Kuroda, Tadahiro ;
Motomura, Masato .
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2018, 53 (04) :983-994
[8]   YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration [J].
Andri, Renzo ;
Cavigelli, Lukas ;
Rossi, Davide ;
Benini, Luca .
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2018, 37 (01) :48-60
[9]   PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference [J].
Ankit, Aayush ;
El Hajj, Izzat ;
Chalamalasetti, Sai Rahul ;
Ndu, Geoffrey ;
Foltin, Martin ;
Williams, R. Stanley ;
Faraboschi, Paolo ;
Hwu, Wen-mei ;
Strachan, John Paul ;
Roy, Kaushik ;
Milojicic, Dejan S. .
TWENTY-FOURTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXIV), 2019, :715-731
[10]  
[Anonymous], ARXIV190305662