Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

被引:15
|
作者
Jia, Liancheng [1 ]
Liang, Yun [1 ]
Li, Xiuhong [1 ]
Lu, Liqiang [1 ]
Yan, Shengen [2 ]
机构
[1] Peking Univ, Ctr Energy Efficient Comp & Applicat, Beijing 100871, Peoples R China
[2] Sensetime Grp, Hong Kong, Peoples R China
关键词
Kernel; Convolution; Task analysis; Graphics processing units; Tensile stress; Instruction sets; Libraries;
D O I
10.1109/TC.2020.2973144
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Modern Convolutional Neural Networks (CNNs) require a massive amount of convolution operations. To address the overwhelming computation problem, Winograd and FFT fast algorithms have been used as effective approaches to reduce the number of multiplications. Inputs and filters are transformed into special domains then perform element-wise multiplication, which can be transformed into batched GEMM operation. Different stages of computation contain multiple tasks with different computation and memory behaviors, and they share intermediate data, which provides the opportunity to fuse these tasks into a monolithic kernel. But traditional kernel fusion suffers from the problem of insufficient shared memory, which limits the performance. In this article, we propose a new kernel fusion technique for fast convolution algorithms based on MegaKernel. GPU thread blocks are assigned with different computation tasks and we design a mapping algorithm to assign tasks to thread blocks. We build a scheduler which fetches and executes the tasks following the dependency relationship. Evaluation of modern CNNs shows that our techniques achieve an average of 1.25X and 1.7X speedup compared to cuDNN's two implementations on Winograd convolution algorithm.
引用
收藏
页码:986 / 997
页数:12
相关论文
共 50 条
  • [1] Fast Algorithms for Knapsack via Convolution and Prediction
    Bateni, MohammadHossein
    Hajiaghayi, MohammadTaghi
    Seddighin, Saeed
    Stein, Cliff
    STOC'18: PROCEEDINGS OF THE 50TH ANNUAL ACM SIGACT SYMPOSIUM ON THEORY OF COMPUTING, 2018, : 1269 - 1282
  • [2] Enabling Fast Preemption via Dual-Kernel Support on GPUs
    Shieh, Li-Wei
    Chen, Kun-Chih
    Fu, Hsueh-Chun
    Wang, Po-Han
    Yang, Chia-Lin
    2017 22ND ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC), 2017, : 121 - 126
  • [3] FLEP: Enabling Flexible and Efficient Preemption on GPUs
    Wu, Bo
    Liu, Xu
    Zhou, Xiaobo
    Jiang, Changjun
    TWENTY-SECOND INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS XXII), 2017, : 483 - 496
  • [4] FLEP: Enabling Flexible and Efficient Preemption on GPUs
    Wu, Bo
    Liu, Xu
    Zhou, Xiaobo
    Jiang, Changjun
    ACM SIGPLAN NOTICES, 2017, 52 (04) : 483 - 496
  • [5] FLEP: Enabling Flexible and Efficient Preemption on GPUs
    Wu, Bo
    Liu, Xu
    Zhou, Xiaobo
    Jiang, Changjun
    OPERATING SYSTEMS REVIEW, 2017, 51 (02) : 483 - 496
  • [6] Adaptation of Algorithms for efficient execution on GPUs
    Bulavintsev, Vadim G.
    Zhdanov, Dmitry D.
    OPTICAL DESIGN AND TESTING XI, 2021, 11895
  • [7] Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs
    Jorda, Marc
    Valero-Lara, Pedro
    Pena, Antonio J.
    IEEE ACCESS, 2019, 7 : 70461 - 70473
  • [8] Fast Parallel Connected Components Algorithms on GPUs
    Cong, Guojing
    Muzio, Paul
    EURO-PAR 2014: PARALLEL PROCESSING WORKSHOPS, PT I, 2014, 8805 : 153 - 164
  • [9] FAST ALGORITHMS FOR THE MAXIMUM CONVOLUTION PROBLEM
    BUSSIECK, M
    HASSLER, H
    WOEGINGER, GJ
    ZIMMERMANN, UT
    OPERATIONS RESEARCH LETTERS, 1994, 15 (03) : 133 - 141
  • [10] Designing Efficient Sorting Algorithms for Manycore GPUs
    Satish, Nadathur
    Harris, Mark
    Garland, Michael
    2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 257 - +