Deep neural networks (DNNs) have shown extraordinary performance in recent years for various applications including image classification, object detection, speech recognition, natural language processing, etc. Accuracydriven DNN architectures tend to increase the model sizes and computations at a very fast pace, demanding a massive amount of hardware resources. Frequent communication between the processing engine and the ON-/OFF-chip memory leads to high energy consumption, which becomes a bottleneck for the conventional DNN accelerator design.