Optimizing Depthwise Separable Convolution Operations on GPUs

被引:35
|
作者
Lu, Gangzhao [1 ]
Zhang, Weizhe [1 ]
Wang, Zheng [2 ]
机构
[1] Harbin Inst Technol, Sch Cyberspace Sci, Harbin 150000, Peoples R China
[2] Univ Leeds, Sch Comp, Leeds LS2 9JT, W Yorkshire, England
基金
中国国家自然科学基金;
关键词
Convolution; Graphics processing units; Instruction sets; Kernel; Standards; Training; Registers; Performance optimization; convolution; depthwise; pointwise; memory optimization; GPU utilization;
D O I
10.1109/TPDS.2021.3084813
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This article aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2x (up to 3x) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces the end-to-end training time of MobileNet and EfficientNet by 9.7 and 7.3 percent respectively, and reduces the end-to-end inference time of MobileNet and EfficientNet by 12.2 and 11.6 percent respectively.
引用
收藏
页码:70 / 87
页数:18
相关论文
共 50 条
  • [21] Optimizing Finite Volume Method Solvers on Nvidia GPUs
    Xu, Jingheng
    Fu, Haohuan
    Luk, Wayne
    Gan, Lin
    Shi, Wen
    Xue, Wei
    Yang, Chao
    Jiang, Yong
    He, Conghui
    Yang, Guangwen
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (12) : 2790 - 2805
  • [22] VarKFaceNet: An Efficient Variable Depthwise Convolution Kernels Neural Network for Lightweight Face Recognition
    Ma, Qinghua
    Zhang, Peng
    Cui, Min
    IEEE ACCESS, 2024, 12 : 117472 - 117482
  • [23] Efficient Energy Disaggregation via Residual Learning-Based Depthwise Separable Convolutions and Segmented Inference
    Zhang, Yusen
    Gao, Feng
    Zhou, Kangjia
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2025, 21 (03) : 2224 - 2233
  • [24] Multiple Video Frame Interpolation via Enhanced Deformable Separable Convolution
    Cheng, Xianhang
    Chen, Zhenzhong
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 7029 - 7045
  • [25] 2SRS: Two-Stream Residual Separable Convolution Neural Network for Hyperspectral Image Classification
    Zahisham, Zharfan
    Lim, Kian Ming
    Koo, Voon Chet
    Chan, Yee Kit
    Lee, Chin Poo
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2023, 20
  • [26] Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs
    Jorda, Marc
    Valero-Lara, Pedro
    Pena, Antonio J.
    IEEE ACCESS, 2019, 7 : 70461 - 70473
  • [27] Disentangled convolution for optimizing receptive field
    Kobayashi, Takumi
    PATTERN RECOGNITION LETTERS, 2023, 169 : 67 - 74
  • [28] Depthwise Convolution for Multi-Agent Communication With Enhanced Mean-Field Approximation
    Xie, Donghan
    Wang, Zhi
    Chen, Chunlin
    Dong, Daoyi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (06) : 8557 - 8569
  • [29] Separable Convolution Gaussian Smoothing Filters on a Xilinx FPGA platform
    Talbi, F.
    Alim, F.
    Seddiki, S.
    Mezzah, I.
    Hachemi, B.
    FIFTH INTERNATIONAL CONFERENCE ON THE INNOVATIVE COMPUTING TECHNOLOGY (INTECH 2015), 2015, : 112 - 117
  • [30] Optimizing 2D convolution for DCUs
    Fan, Wenlong
    Hua, Haobo
    Shang, Jiandong
    Wen, Zhuxin
    Guo, Hengliang
    Zhang, Litao
    CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2025,