Optimizing Depthwise Separable Convolution Operations on GPUs

被引:35
|
作者
Lu, Gangzhao [1 ]
Zhang, Weizhe [1 ]
Wang, Zheng [2 ]
机构
[1] Harbin Inst Technol, Sch Cyberspace Sci, Harbin 150000, Peoples R China
[2] Univ Leeds, Sch Comp, Leeds LS2 9JT, W Yorkshire, England
基金
中国国家自然科学基金;
关键词
Convolution; Graphics processing units; Instruction sets; Kernel; Standards; Training; Registers; Performance optimization; convolution; depthwise; pointwise; memory optimization; GPU utilization;
D O I
10.1109/TPDS.2021.3084813
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This article aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2x (up to 3x) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces the end-to-end training time of MobileNet and EfficientNet by 9.7 and 7.3 percent respectively, and reduces the end-to-end inference time of MobileNet and EfficientNet by 12.2 and 11.6 percent respectively.
引用
收藏
页码:70 / 87
页数:18
相关论文
共 50 条
  • [41] Novel Dilated Separable Convolution Networks for Efficient Video Salient Object Detection in the Wild
    Singh, Hemraj
    Verma, Mridula
    Cheruku, Ramalingaswamy
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72
  • [42] S-EEGNet: Electroencephalogram Signal Classification Based on a Separable Convolution Neural Network With Bilinear Interpolation
    Huang, Wenkai
    Xue, Yihao
    Hu, Lingkai
    Liuli, Hantang
    IEEE ACCESS, 2020, 8 : 131636 - 131646
  • [43] Lightweight Bridge Crack Detection Method Based on SegNet and Bottleneck Depth-Separable Convolution With Residuals
    Zheng, Xuan
    Zhang, Shuailong
    Li, Xue
    Li, Gang
    Li, Xiyuan
    IEEE ACCESS, 2021, 9 : 161649 - 161668
  • [44] Lw-PSCNN: Lightweight Pointwise-Separable Convolution Neural Network for ISAR Image Classification
    Palguna, K. R. Gopireddy
    Kumar, G. Arun
    Ram, Gopi
    Hashmi, Mohammad Farukh
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2025, 74
  • [45] Optimizing 2D convolution for DCUsOptimizing 2D convolution...W. Fan et al.
    Wenlong Fan
    Haobo Hua
    Jiandong Shang
    Zhuxin Wen
    Hengliang Guo
    Litao Zhang
    CCF Transactions on High Performance Computing, 2025, 7 (2) : 142 - 154
  • [46] FSCNN: Fuzzy Channel Filter-Based Separable Convolution Neural Networks for Medical Imaging Recognition
    Huang, Hao
    Oh, Sung-Kwun
    Fu, Zunwei
    Wu, Chuan-Kun
    Pedrycz, Witold
    Kim, Jin-Yul
    IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2024, 32 (10) : 5449 - 5461
  • [47] Optimizing Half Precision Winograd Convolution on ARM Many-Core Processors
    Xie, Dedong
    Jia, Zhen
    Zhang, Zili
    Jin, Xin
    PROCEEDINGS OF THE 13TH ACM SIGOPS ASIA-PACIFIC WORKSHOP ON SYSTEMS, APSYS 2022, 2022, : 53 - 60
  • [48] Optimizing N-Dimensional, Winograd-Based Convolution for Manycore CPUs
    Jia, Zhen
    Zlateski, Aleksandar
    Durand, Fredo
    Li, Kai
    ACM SIGPLAN NOTICES, 2018, 53 (01) : 109 - 123
  • [49] Flexible hardware-friendly digital architecture for 2-D separable-convolution-based scaling
    Cardells-Tormo, Francisco
    Arnabat-Benedicto, Jordi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2006, 53 (07) : 522 - 526
  • [50] Accelerating Clifford Algebra Operations using GPUs and an OpenCL Code Generator
    Franchini, Silvia
    Gentile, Antonio
    Vassallo, Giorgio
    Vitabile, Salvatore
    2015 EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD), 2015, : 57 - 64