Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks

被引:0
|
作者
Tsai, Yaohung M. [1 ]
Luszczek, Piotr [1 ]
Kurzak, Jakub [1 ]
Dongarra, Jack [1 ,2 ]
机构
[1] Univ Tennessee Knoxville, Knoxville, TN 37996 USA
[2] Oak Ridge Natl Lab, Oak Ridge, TN USA
来源
PROCEEDINGS OF 2016 2ND WORKSHOP ON MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC) | 2016年
基金
美国国家科学基金会;
关键词
D O I
10.1109/MLHPC.2016.5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a portable and highly-optimized Deep Neural Network (DNN) algorithm and its implementation techniques. Our approach is a novel combination of existing HPC techniques that methodically applies autotuning as well as data layout and low-level optimizations that achieve performance matching and/or exceeding what is possible with either reverse engineering and manual assembly coding or proprietary vendor libraries. The former was done inside the maxDNN implementation and the latter is represented by cuDNN. Our work may be directly applied to the most time consuming part of DNN workflow, namely the training process which often needs a restart when it stagnates due to, for example, diminishing gradients and getting stuck in local minima. With the result of performance tests on a consumer-grade GPU with the latest High Bandwidth Memory (HBM) stack, our methodology can match a server grade hardware at a fraction of the price. Another tuning sweep on a new GPU architecture from a different vendor also attests to the portability of our approach and the quality of our implementation.
引用
收藏
页码:9 / 18
页数:10
相关论文
共 50 条
  • [1] Developing Performance-Portable Molecular Dynamics Kernels in OpenCL
    Pennycook, S. J.
    Jarvis, S. A.
    2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 386 - 395
  • [2] pocl: A Performance-Portable OpenCL Implementation
    Jaaskelainen, Pekka
    Sanchez de La Lama, Carlos
    Schnetter, Erik
    Raiskila, Kalle
    Takala, Jarmo
    Berg, Heikki
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2015, 43 (05) : 752 - 785
  • [3] pocl: A Performance-Portable OpenCL Implementation
    Pekka Jääskeläinen
    Carlos Sánchez de La Lama
    Erik Schnetter
    Kalle Raiskila
    Jarmo Takala
    Heikki Berg
    International Journal of Parallel Programming, 2015, 43 : 752 - 785
  • [4] Autotuning techniques for performance-portable point set registration in 3D
    Luszczek P.
    Kurzak J.
    Yamazaki I.
    Keffer D.
    Maroulas V.
    Dongarra J.
    Supercomputing Frontiers and Innovations, 2018, 5 (04) : 42 - 61
  • [5] Towards performance-portable generation of tensor kernels for computational chemistry
    Rajbhandari, Samyam
    Stock, Kevin
    Sadayappan, P.
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2014, 248
  • [6] Clustering Convolutional Kernels to Compress Deep Neural Networks
    Son, Sanghyun
    Nah, Seungjun
    Lee, Kyoung Mu
    COMPUTER VISION - ECCV 2018, PT VIII, 2018, 11212 : 225 - 240
  • [7] PPOpenCL: A Performance-Portable OpenCL Compiler with Host and Kernel Thread Code Fusion
    Liu, Ying
    Huang, Lei
    Wu, Mingchuan
    Cui, Huimin
    Lv, Fang
    Feng, Xiaobing
    Xue, Jingling
    PROCEEDINGS OF THE 28TH INTERNATIONAL CONFERENCE ON COMPILER CONSTRUCTION (CC '19), 2019, : 2 - 16
  • [8] DualConv: Dual Convolutional Kernels for Lightweight Deep Neural Networks
    Zhong, Jiachen
    Chen, Junying
    Mian, Ajmal
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (11) : 9528 - 9535
  • [9] A study on the uncertainty of convolutional layers in deep neural networks
    Haojing Shen
    Sihong Chen
    Ran Wang
    International Journal of Machine Learning and Cybernetics, 2021, 12 : 1853 - 1865
  • [10] A study on the uncertainty of convolutional layers in deep neural networks
    Shen, Haojing
    Chen, Sihong
    Wang, Ran
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (06) : 1853 - 1865