Reverse -Mode Automatic Differentiation and Optimization of CPU Kernels via Enzyme

被引:29
作者
Moses, William S. [1 ]
Churavy, Valentin [1 ]
Paehler, Ludger [2 ]
Hueckelheim, Jan [3 ]
Narayanan, Sri Hari Krishna [3 ]
Schanen, Michel [3 ]
Doerfert, Johannes [3 ]
机构
[1] MIT, CSAIL, Cambridge, MA 02139 USA
[2] Tech Univ Munich, Munich, Germany
[3] Argonne Natl Lab, Lemont, IL USA
来源
SC21: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2021年
关键词
Automatic Differentiation; AD; CUDA; ROCm; GPU; LLVM; HPC; ADJOINT; ALGORITHM; COMPILER;
D O I
10.1145/3458817.3476165
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Computing derivatives is key to many algoritluns in scientific computing and machine learning such as optimization, uncertainty quantification, and stability analysis. Enzyme is a LLVM compiler plugin that performs reverse -mode automatic differentiation (AD) and thus generates high performance gradients of programs in languages including C/C++, Fortran, Julia, and Rust. Prior to this work, Enzyme and other AD tools were not capable of generating gradients of GPU kernels. Our paper presents a combination of novel techniques that make Enzyme the first fully automatic reverse mode Al) tool to generate gradients of CPU kernels. Since unlike other tools Enzyme performs automatic differentiation within a general-purpose compiler, we are able to introduce several novel GPU and AD-specific optimizations. To show the generality and efficiency of our approach, we compute gradients of five GPU-based HPC applications, executed on NVIDIA and AIVID CPUs. All benchmarks run within an order of magnitude of the original program's execution time. Without CPU and AD -specific optimizations, gradients of CPU kernels either fail to run from a laCk of resources or have infeasible overhead. Finally, we demonstrate that increasing the problem size by either increasing the number of threads or increasing the work per thread, does not substantially impact the overhead from differentiation.
引用
收藏
页数:18
相关论文
共 61 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]   A GPU-accelerated continuous and discontinuous Galerkin non-hydrostatic atmospheric model [J].
Abdi, Daniel S. ;
Wilcox, Lucas C. ;
Warburton, Timothy C. ;
Giraldo, Francis X. .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2019, 33 (01) :81-109
[3]  
Aho AlfredV., 2006, COMPILERS
[4]  
[Anonymous], 2012, CTR RELIABLE HIGH PE
[5]  
[Anonymous], 1997, Advanced Compiler Design and Implementation
[6]   OPTIMAL MULTISTAGE ALGORITHM FOR ADJOINT COMPUTATION [J].
Aupy, Guillaume ;
Herrmann, Julien ;
Hovland, Paul ;
Robert, Yves .
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2016, 38 (03) :C232-C255
[7]  
Belbute-Peres FD, 2018, ADV NEUR IN, V31
[8]  
Bell B. M., 2012, Computational Infrastructure for Operations Research, V57
[9]   Effective Extensible Programming: Unleashing Julia on GPUs [J].
Besard, Tim ;
Foket, Christophe ;
De Sutter, Bjorn .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (04) :827-841
[10]   Julia: A Fresh Approach to Numerical Computing [J].
Bezanson, Jeff ;
Edelman, Alan ;
Karpinski, Stefan ;
Shah, Viral B. .
SIAM REVIEW, 2017, 59 (01) :65-98