Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks

被引：0

作者：

Tsai, Yaohung M. ^{[1
]}

Luszczek, Piotr ^{[1
]}

Kurzak, Jakub ^{[1
]}

Dongarra, Jack ^{[1
,2
]}

机构：

[1] Univ Tennessee Knoxville, Knoxville, TN 37996 USA

[2] Oak Ridge Natl Lab, Oak Ridge, TN USA

来源：

PROCEEDINGS OF 2016 2ND WORKSHOP ON MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC) | 2016年

基金：

美国国家科学基金会;

关键词：

D O I：

10.1109/MLHPC.2016.5

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present a portable and highly-optimized Deep Neural Network (DNN) algorithm and its implementation techniques. Our approach is a novel combination of existing HPC techniques that methodically applies autotuning as well as data layout and low-level optimizations that achieve performance matching and/or exceeding what is possible with either reverse engineering and manual assembly coding or proprietary vendor libraries. The former was done inside the maxDNN implementation and the latter is represented by cuDNN. Our work may be directly applied to the most time consuming part of DNN workflow, namely the training process which often needs a restart when it stagnates due to, for example, diminishing gradients and getting stuck in local minima. With the result of performance tests on a consumer-grade GPU with the latest High Bandwidth Memory (HBM) stack, our methodology can match a server grade hardware at a fraction of the price. Another tuning sweep on a new GPU architecture from a different vendor also attests to the portability of our approach and the quality of our implementation.

引用

页码：9 / 18

页数：10

共 50 条

[1] Developing Performance-Portable Molecular Dynamics Kernels in OpenCL
Pennycook, S. J.
Jarvis, S. A.
2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 386 - 395
[2] pocl: A Performance-Portable OpenCL Implementation
Jaaskelainen, Pekka
Sanchez de La Lama, Carlos
Schnetter, Erik
Raiskila, Kalle
Takala, Jarmo
Berg, Heikki
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2015, 43 (05) : 752 - 785
[3] pocl: A Performance-Portable OpenCL Implementation
Pekka Jääskeläinen
Carlos Sánchez de La Lama
Erik Schnetter
Kalle Raiskila
Jarmo Takala
Heikki Berg
International Journal of Parallel Programming, 2015, 43 : 752 - 785
[4] Autotuning techniques for performance-portable point set registration in 3D
Luszczek P.
Kurzak J.
Yamazaki I.
Keffer D.
Maroulas V.
Dongarra J.
Supercomputing Frontiers and Innovations, 2018, 5 (04) : 42 - 61
[5] Towards performance-portable generation of tensor kernels for computational chemistry
Rajbhandari, Samyam
Stock, Kevin
Sadayappan, P.
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2014, 248
[6] Clustering Convolutional Kernels to Compress Deep Neural Networks
Son, Sanghyun
Nah, Seungjun
Lee, Kyoung Mu
COMPUTER VISION - ECCV 2018, PT VIII, 2018, 11212 : 225 - 240
[7] PPOpenCL: A Performance-Portable OpenCL Compiler with Host and Kernel Thread Code Fusion
Liu, Ying
Huang, Lei
Wu, Mingchuan
Cui, Huimin
Lv, Fang
Feng, Xiaobing
Xue, Jingling
PROCEEDINGS OF THE 28TH INTERNATIONAL CONFERENCE ON COMPILER CONSTRUCTION (CC '19), 2019, : 2 - 16
[8] DualConv: Dual Convolutional Kernels for Lightweight Deep Neural Networks
Zhong, Jiachen
Chen, Junying
Mian, Ajmal
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (11) : 9528 - 9535
[9] A study on the uncertainty of convolutional layers in deep neural networks
Haojing Shen
Sihong Chen
Ran Wang
International Journal of Machine Learning and Cybernetics, 2021, 12 : 1853 - 1865
[10] A study on the uncertainty of convolutional layers in deep neural networks
Shen, Haojing
Chen, Sihong
Wang, Ran
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (06) : 1853 - 1865

← 1 2 3 4 5 →