Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions Atomically

被引：1

作者：

Zhao, Jie ^{[1
]}

Bastoul, Cedric ^{[2
]}

Yi, Yanzhi ^{[3
]}

Hu, Jiahui ^{[3
]}

Nie, Wang ^{[3
]}

Zhang, Renwei ^{[3
]}

Geng, Zhen ^{[3
]}

Li, Chong ^{[2
]}

Tachon, Thibaut ^{[2
]}

Gan, Zhiliang ^{[3
]}

机构：

[1] State Key Lab Math Engn & Adv Comp Zhengzhou, Zhengzhou, Peoples R China

[2] Huawei Technol France SASU, Paris, France

[3] Huawei Technol Co Ltd, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

deep learning; reduction; GPU; polyhedral compilation;

D O I：

10.1145/3559009.3569656

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Due to the missing of a good orchestration of loop transformations, existing optimizing compilers for deploying neural networks on GPU either parallelize reductions ineffectively or miss the fusion opportunities with other operators. Neural network models thus exhibit sub-optimal performance on GPU. We present a practical approach called Panamera for the effective parallelization of reductions in neural networks on GPU. Panamera first leverages loop coalescing to flatten the loop dimensions of reductions, converting all reduction operators into canonical forms eligible for the polyhedral model. Next, Panamera uses polyhedral transformations to reduce the data movements caused by unfused reductions and perform multi-block hardware binding not considered by many compilers. Finally, Panamera embeds a highly optimized routine implemented using GPU atomic instructions, further improving the performance of neural network models while guaranteeing the correctness of parallel reductions. The experimental results demonstrate the effectiveness of our approach: for single operators our code obtains a mean speedup of 33.7x, 3.5x, 5.4x and 9.6x over cuDNN, CUB, TVM and Ansor, for sub-graphs our approach outperforms cuDNN, TVM and Ansor by 9.5x, 2.6x and 2.7x, and for end-to-end workloads, a tensor compiler integrated with our approach outperforms them by 122.5%, 19.3% and 15.2%.

引用

页码：451 / 466

页数：16

共 66 条

[1]

Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265

[2] AUTOMATIC TRANSLATION OF FORTRAN PROGRAMS TO VECTOR FORM [J].

ALLEN, R ;

KENNEDY, K .

ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS, 1987, 9 (04) :491-542

[3]

[Anonymous], 2014, Faster parallel reductions on kepler

[4]

[Anonymous], 2007, Nvidia developer technology

[5]

Baghdadi R, 2019, INT SYM CODE GENER, P193, DOI [10.5281/zenodo.2375075, 10.1109/CGO.2019.8661197]

[6] Opening Polyhedral Compiler's Black Box [J].

Bagneres, Lenaic ;

Zinenko, Oleksandr ;

Huot, Stephane ;

Bastoul, Cedric .

PROCEEDINGS OF CGO 2016: THE 14TH INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, 2016, :128-138

[7]

Bell N., 2012, GPU Computing Gems Jade Edition, P359, DOI DOI 10.1016/B978-0-12-385963-1.00026-5

[8]

Benaissa Zino, 2015, 5 INT WORKSH POL COM

[9] A Practical Automatic Polyhedral Parallelizer and Locality Optimizer [J].

Bondhugula, Uday ;

Hartono, Albert ;

Ramanujam, J. ;

Sadayappan, R. .

PLDI'08: PROCEEDINGS OF THE 2008 SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN & IMPLEMENTATION, 2008, :101-+

[10]

Brown TB, 2020, ADV NEUR IN, V33

← 1 2 3 4 5 6 7 →