Optimizing Depthwise Separable Convolution Operations on GPUs

被引:35
|
作者
Lu, Gangzhao [1 ]
Zhang, Weizhe [1 ]
Wang, Zheng [2 ]
机构
[1] Harbin Inst Technol, Sch Cyberspace Sci, Harbin 150000, Peoples R China
[2] Univ Leeds, Sch Comp, Leeds LS2 9JT, W Yorkshire, England
基金
中国国家自然科学基金;
关键词
Convolution; Graphics processing units; Instruction sets; Kernel; Standards; Training; Registers; Performance optimization; convolution; depthwise; pointwise; memory optimization; GPU utilization;
D O I
10.1109/TPDS.2021.3084813
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The depthwise separable convolution is commonly seen in convolutional neural networks (CNNs), and is widely used to reduce the computation overhead of a standard multi-channel 2D convolution. Existing implementations of depthwise separable convolutions target accelerating model training with large batch sizes with a large number of samples to be processed at once. Such approaches are inadequate for small-batch-sized model training and the typical scenario of model inference where the model takes in a few samples at once. This article aims to bridge the gap of optimizing depthwise separable convolutions by targeting the GPU architecture. We achieve this by designing two novel algorithms to improve the column and row reuse of the convolution operation to reduce the number of memory operations performed on the width and the height dimensions of the 2D convolution. Our approach employs a dynamic tile size scheme to adaptively distribute the computational data across GPU threads to improve GPU utilization and to hide the memory access latency. We apply our approach on two GPU platforms: an NVIDIA RTX 2080Ti GPU and an embedded NVIDIA Jetson AGX Xavier GPU, and two data types: 32-bit floating point (FP32) and 8-bit integer (INT8). We compared our approach against cuDNN that is heavily tuned for the NVIDIA GPU architecture. Experimental results show that our approach delivers over 2x (up to 3x) performance improvement over cuDNN. We show that, when using a moderate batch size, our approach averagely reduces the end-to-end training time of MobileNet and EfficientNet by 9.7 and 7.3 percent respectively, and reduces the end-to-end inference time of MobileNet and EfficientNet by 12.2 and 11.6 percent respectively.
引用
收藏
页码:70 / 87
页数:18
相关论文
共 50 条
  • [1] Mobile-X: Dedicated FPGA Implementation of the MobileNet Accelerator Optimizing Depthwise Separable Convolution
    Hong, Hyeonseok
    Choi, Dahun
    Kim, Namjoon
    Kim, Hyun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2024, 71 (11) : 4668 - 4672
  • [2] Optimizing GPU Memory Transactions for Convolution Operations
    Lu, Gangzhao
    Zhang, Weizhe
    Wang, Zheng
    2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020), 2020, : 399 - 403
  • [3] Improved Depthwise Separable Convolution for Transfer Learning in Fault Diagnosis
    Xu, Hai
    Xiao, Yongchang
    Sun, Kun
    Cui, Lingli
    IEEE SENSORS JOURNAL, 2024, 24 (20) : 33606 - 33613
  • [4] Optimizing Batched Winograd Convolution on GPUs
    Yan, Da
    Wang, Wei
    Chu, Xiaowen
    PROCEEDINGS OF THE 25TH ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING (PPOPP '20), 2020, : 32 - 44
  • [5] An FPGA-Based Approach for Compressing and Accelerating Depthwise Separable Convolution
    Yang, Ruiheng
    Chen, Zhikun
    Hu, Lingtong
    Cui, Xihang
    Guo, Yunfei
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2590 - 2594
  • [6] CSDS: End-to-End Aerial Scenes Classification With Depthwise Separable Convolution and an Attention Mechanism
    Wang, Xinyu
    Yuan, Liming
    Xu, Haixia
    Wen, Xianbin
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 (14) : 10484 - 10499
  • [7] Low-Power Hardware Architecture for Depthwise Separable Convolution Unit Design
    Lin, Shi-Rou
    Lin, Wei-Hung
    Huang, Shih-Hsu
    Hsu, Chun-Lung
    Sun, Chitien
    2020 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TAIWAN), 2020,
  • [8] Shallow Network Based on Depthwise Overparameterized Convolution for Hyperspectral Image Classification
    Gao, Hongmin
    Chen, Zhonghao
    Li, Chenming
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [9] Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs
    Chen, Xiaoming
    Chen, Jianxu
    Chen, Danny Z.
    Hu, Xiaobo Sharon
    PROCEEDINGS OF THE 2017 54TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2017,
  • [10] Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels
    Jia, Liancheng
    Liang, Yun
    Li, Xiuhong
    Lu, Liqiang
    Yan, Shengen
    IEEE TRANSACTIONS ON COMPUTERS, 2020, 69 (07) : 986 - 997