Optimizing GPU Memory Transactions for Convolution Operations

被引:5
作者
Lu, Gangzhao [1 ]
Zhang, Weizhe [1 ]
Wang, Zheng [2 ]
机构
[1] Harbin Inst Technol, Comp Sci & Technol, Harbin, Peoples R China
[2] Univ Leeds, Sch Comp, Leeds, W Yorkshire, England
来源
2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020) | 2020年
关键词
Performance Optimization; Convolution; Memory Optimization; GPUs;
D O I
10.1109/CLUSTER49012.2020.00050
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Convolution computation is a common operation in deep neural networks (DNNs) and is often responsible for performance bottlenecks during training and inferencing. Existing approaches for accelerating convolution operations aim to reduce computational complexity. However, these strategies often increase the memory footprint with extra memory accesses, thereby leaving much room for performance improvement. This paper presents a novel approach to optimize memory access for convolution operations, specifically targeting GPU execution. Our approach leverages two optimization techniques to reduce the number of memory operations for convolution operations performed on the width and height dimensions. For convolution computations on the width dimension, we exploit shuffle instructions to exchange the overlapped columns of the input for reducing the number of memory transactions. For convolution operations on the height dimension, we multiply each overlapped row of the input with multiple rows of a filter to compute multiple output elements to improve the data locality of row elements. We apply our approach to 2D and multi-channel 2D convolutions on an NVIDIA 2080Ti GPu. For 2D convolution, our approach delivers over 2x faster performance than the state-of-the-art image processing libraries. For multi-channel 2D convolutions, we obtain up to 1.3x speedups over the quickest algorithm of cuDNN.
引用
收藏
页码:399 / 403
页数:5
相关论文
共 16 条
  • [1] [Anonymous], 2014, ARXIV14091556
  • [2] [Anonymous], 2015, ArrayFire - A high performance software library for parallel computing with an easy-to-use API
  • [3] Chellapilla K., 2006, P 10 INT WORKSHOP FR
  • [4] Chetlur S., 2014, CUDNN EFFICIENT PRIM
  • [5] Cho M, 2017, PR MACH LEARN RES, V70
  • [6] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [7] Iandola FN, 2013, IEEE IMAGE PROC, P2116, DOI 10.1109/ICIP.2013.6738436
  • [8] Caffe: Convolutional Architecture for Fast Feature Embedding
    Jia, Yangqing
    Shelhamer, Evan
    Donahue, Jeff
    Karayev, Sergey
    Long, Jonathan
    Girshick, Ross
    Guadarrama, Sergio
    Darrell, Trevor
    [J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 675 - 678
  • [9] ImageNet Classification with Deep Convolutional Neural Networks
    Krizhevsky, Alex
    Sutskever, Ilya
    Hinton, Geoffrey E.
    [J]. COMMUNICATIONS OF THE ACM, 2017, 60 (06) : 84 - 90
  • [10] Fast Algorithms for Convolutional Neural Networks
    Lavin, Andrew
    Gray, Scott
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4013 - 4021