Gist: Efficient Data Encoding for Deep Neural Network Training

被引:94
作者
Jain, Animesh [1 ,2 ]
Phanishayee, Amar [1 ]
Mars, Jason [2 ]
Tang, Lingjia [2 ]
Pekhimenko, Gennady [3 ]
机构
[1] Microsoft Res, Redmond, WA USA
[2] Univ Michigan, Ann Arbor, MI 48109 USA
[3] Univ Toronto, Toronto, ON, Canada
来源
2018 ACM/IEEE 45TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA) | 2018年
基金
美国国家科学基金会;
关键词
DNN Training; Data Encodings; Compression;
D O I
10.1109/ISCA.2018.00070
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Modern deep neural networks (DNNs) training typically relies on GPUs to train complex hundred-layer deep networks. A significant problem facing both researchers and industry practitioners is that, as the networks get deeper, the available GPU main memory becomes a primary bottleneck, limiting the size of networks it can train. In this paper, we investigate widely used DNNs and find that the major contributors to memory footprint are intermediate layer outputs (feature maps). We then introduce a framework for DNN-layer-specific optimizations (e.g., convolution, ReLU, pool) that significantly reduce this source of main memory pressure on GPUs. We find that a feature map typically has two uses that are spread far apart temporally. Our key approach is to store an encoded representation of feature maps for this temporal gap and decode this data for use in the backward pass; the full-fidelity feature maps are used in the forward pass and relinquished immediately. Based on this approach, we present Gist, our system that employs two classes of layer-specific encoding schemes - lossless and lossy - to exploit existing value redundancy in DNN training to significantly reduce the memory consumption of targeted feature maps. For example, one insight is by taking advantage of the computational nature of back propagation from pool to ReLU layer, we can store the intermediate feature map using just 1 bit instead of 32 bits per value. We deploy these mechanisms in a state-of-the-art DNN framework (CNTK) and observe that Gist reduces the memory footprint to upto 2x across 5 state-of-the-art image classification DNNs, with an average of 1.8x with only 4% performance overhead. We also show that further software (e.g., CuDNN) and hardware (e.g., dynamic allocation) optimizations can result in even larger footprint reduction (upto 4.1x).
引用
收藏
页码:776 / 789
页数:14
相关论文
共 51 条
[1]  
[Anonymous], INT C SUP SC
[2]  
[Anonymous], 1988, LEARNING REPRESENTAT
[3]  
[Anonymous], INT S COMP ARCH ISCA
[4]  
[Anonymous], 2018, TBD BENCHMARKING ANA
[5]  
[Anonymous], INT S COMP ARCH ISCA
[6]  
[Anonymous], NEURAL INFORM PROCES
[7]  
[Anonymous], PARALLEL ARCHITECTUR
[8]  
[Anonymous], 2014, CUDNN EFFICIENT PRIM
[9]  
[Anonymous], 2014, CORR
[10]  
[Anonymous], INT C COMP AID DES I